2025-05-07T20:23:26.0708077Z Current runner version: '2.323.0'
2025-05-07T20:23:26.0715107Z Runner name: 'i-02a13dec7b575dc8f'
2025-05-07T20:23:26.0716051Z Machine name: 'ip-10-0-35-243'
2025-05-07T20:23:26.0718743Z ##[group]GITHUB_TOKEN Permissions
2025-05-07T20:23:26.0720958Z Contents: read
2025-05-07T20:23:26.0721475Z Metadata: read
2025-05-07T20:23:26.0721968Z Packages: read
2025-05-07T20:23:26.0722460Z ##[endgroup]
2025-05-07T20:23:26.0724347Z Secret source: None
2025-05-07T20:23:26.0724962Z Prepare workflow directory
2025-05-07T20:23:26.1680499Z Prepare all required actions
2025-05-07T20:23:26.1719103Z Getting action download info
2025-05-07T20:23:26.3967547Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683)
2025-05-07T20:23:26.6883205Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093)
2025-05-07T20:23:27.0606919Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187)
2025-05-07T20:23:28.7679890Z Getting action download info
2025-05-07T20:23:28.9051295Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482)
2025-05-07T20:23:29.0985662Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.12, 12.6.3, 12.6.3, clang)
2025-05-07T20:23:29.1492644Z A job started hook has been configured by the self-hosted runner administrator
2025-05-07T20:23:29.1601828Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh'
2025-05-07T20:23:29.1613210Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:29.1613946Z ##[endgroup]
2025-05-07T20:23:30.4260877Z Runner Type: linux.g5.4xlarge.nvidia.gpu
2025-05-07T20:23:30.4261307Z Instance Type: g5.4xlarge
2025-05-07T20:23:30.4261562Z AMI Name: unknown
2025-05-07T20:23:30.4299983Z AMI ID: ami-071226ecf16aa7d96
2025-05-07T20:23:35.8681726Z ##[group]Run actions/checkout@v4
2025-05-07T20:23:35.8682040Z with:
2025-05-07T20:23:35.8682286Z   submodules: true
2025-05-07T20:23:35.8682538Z   repository: pytorch/FBGEMM
2025-05-07T20:23:35.8682930Z   token: ***
2025-05-07T20:23:35.8683149Z   ssh-strict: true
2025-05-07T20:23:35.8683368Z   ssh-user: git
2025-05-07T20:23:35.8683600Z   persist-credentials: true
2025-05-07T20:23:35.8683857Z   clean: true
2025-05-07T20:23:35.8684098Z   sparse-checkout-cone-mode: true
2025-05-07T20:23:35.8684378Z   fetch-depth: 1
2025-05-07T20:23:35.8684603Z   fetch-tags: false
2025-05-07T20:23:35.8684830Z   show-progress: true
2025-05-07T20:23:35.8685054Z   lfs: false
2025-05-07T20:23:35.8685272Z   set-safe-directory: true
2025-05-07T20:23:35.8685528Z env:
2025-05-07T20:23:35.8685756Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:35.8686088Z   BUILD_ENV: build_binary
2025-05-07T20:23:35.8686375Z   BUILD_TARGET: genai
2025-05-07T20:23:35.8686600Z   BUILD_VARIANT: cuda
2025-05-07T20:23:35.8686873Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:35.8687130Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:35.8687366Z ##[endgroup]
2025-05-07T20:23:35.9869172Z Syncing repository: pytorch/FBGEMM
2025-05-07T20:23:35.9870388Z ##[group]Getting Git version info
2025-05-07T20:23:35.9870838Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:23:35.9871763Z [command]/usr/bin/git version
2025-05-07T20:23:35.9872137Z git version 2.47.1
2025-05-07T20:23:35.9885520Z ##[endgroup]
2025-05-07T20:23:35.9907497Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/3b9d1976-eeb5-46b3-aa74-005056000165' before making global git config changes
2025-05-07T20:23:35.9908696Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:23:35.9912324Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:35.9954137Z [command]/usr/bin/git config --local --get remote.origin.url
2025-05-07T20:23:35.9978662Z https://github.com/pytorch/FBGEMM
2025-05-07T20:23:35.9996368Z ##[group]Removing previously created refs, to avoid conflicts
2025-05-07T20:23:36.0001243Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD
2025-05-07T20:23:36.0028040Z refs/heads/main
2025-05-07T20:23:36.0036942Z [command]/usr/bin/git checkout --detach
2025-05-07T20:23:36.8887308Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:36.8942136Z [command]/usr/bin/git branch --delete --force main
2025-05-07T20:23:36.8970312Z Deleted branch main (was b6b2ce3).
2025-05-07T20:23:36.8975878Z ##[endgroup]
2025-05-07T20:23:36.8979656Z [command]/usr/bin/git submodule status
2025-05-07T20:23:36.9403667Z  e5d7c0bd5d9aec44d68830187138149e6a8c4e32 external/asmjit (e5d7c0b)
2025-05-07T20:23:36.9490582Z  4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 external/composable_kernel (4a61bdd)
2025-05-07T20:23:36.9580595Z  6543fec09b2f04ac4a666882998b534afc9c1349 external/cpuinfo (6543fec)
2025-05-07T20:23:36.9668877Z  3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 external/cutlass (3ed8d2e)
2025-05-07T20:23:36.9756335Z  f8d7d77c06936315286eb55f8de22cd23c188571 external/googletest (f8d7d77)
2025-05-07T20:23:36.9839795Z  420084499c7c1e1c2d801922f40df202eac5f3a0 external/hipify_torch (4200844)
2025-05-07T20:23:36.9921053Z  9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 external/json (9cca280)
2025-05-07T20:23:36.9934641Z ##[group]Cleaning the repository
2025-05-07T20:23:36.9939698Z [command]/usr/bin/git clean -ffdx
2025-05-07T20:23:36.9996090Z [command]/usr/bin/git reset --hard HEAD
2025-05-07T20:23:37.0103725Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:37.0110666Z ##[endgroup]
2025-05-07T20:23:37.0112311Z ##[group]Disabling automatic garbage collection
2025-05-07T20:23:37.0115653Z [command]/usr/bin/git config --local gc.auto 0
2025-05-07T20:23:37.0147899Z ##[endgroup]
2025-05-07T20:23:37.0148275Z ##[group]Setting up auth
2025-05-07T20:23:37.0153544Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:23:37.0184690Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:23:37.0515494Z Entering 'external/asmjit'
2025-05-07T20:23:37.0589088Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.0664095Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.0731318Z Entering 'external/cutlass'
2025-05-07T20:23:37.0804065Z Entering 'external/googletest'
2025-05-07T20:23:37.0870791Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.0936370Z Entering 'external/json'
2025-05-07T20:23:37.1026636Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:23:37.1060486Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:23:37.1396577Z Entering 'external/asmjit'
2025-05-07T20:23:37.1463962Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.1537440Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.1604637Z Entering 'external/cutlass'
2025-05-07T20:23:37.1682121Z Entering 'external/googletest'
2025-05-07T20:23:37.1749219Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.1816714Z Entering 'external/json'
2025-05-07T20:23:37.1905471Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:23:37.1958614Z ##[endgroup]
2025-05-07T20:23:37.1959157Z ##[group]Fetching the repository
2025-05-07T20:23:37.1966097Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge
2025-05-07T20:23:37.4260657Z From https://github.com/pytorch/FBGEMM
2025-05-07T20:23:37.4261321Z  * [new ref]         a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge
2025-05-07T20:23:37.4287626Z ##[endgroup]
2025-05-07T20:23:37.4288103Z ##[group]Determining the checkout info
2025-05-07T20:23:37.4290050Z ##[endgroup]
2025-05-07T20:23:37.4295631Z [command]/usr/bin/git sparse-checkout disable
2025-05-07T20:23:37.4348031Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2025-05-07T20:23:37.4377239Z ##[group]Checking out the ref
2025-05-07T20:23:37.4381890Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge
2025-05-07T20:23:37.4509382Z Previous HEAD position was b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:37.4512516Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4
2025-05-07T20:23:37.4521696Z ##[endgroup]
2025-05-07T20:23:37.4522094Z ##[group]Setting up auth for fetching submodules
2025-05-07T20:23:37.4528650Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:23:37.4579640Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf
2025-05-07T20:23:37.4609455Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com:
2025-05-07T20:23:37.4640943Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com:
2025-05-07T20:23:37.4669402Z ##[endgroup]
2025-05-07T20:23:37.4669945Z ##[group]Fetching submodules
2025-05-07T20:23:37.4673282Z [command]/usr/bin/git submodule sync
2025-05-07T20:23:37.5046934Z Synchronizing submodule url for 'external/asmjit'
2025-05-07T20:23:37.5047590Z Synchronizing submodule url for 'external/composable_kernel'
2025-05-07T20:23:37.5048137Z Synchronizing submodule url for 'external/cpuinfo'
2025-05-07T20:23:37.5048523Z Synchronizing submodule url for 'external/cutlass'
2025-05-07T20:23:37.5049207Z Synchronizing submodule url for 'external/googletest'
2025-05-07T20:23:37.5049633Z Synchronizing submodule url for 'external/hipify_torch'
2025-05-07T20:23:37.5050045Z Synchronizing submodule url for 'external/json'
2025-05-07T20:23:37.5062508Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1
2025-05-07T20:23:37.5499652Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32'
2025-05-07T20:23:37.5651531Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406'
2025-05-07T20:23:37.5753490Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349'
2025-05-07T20:23:37.5924764Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3'
2025-05-07T20:23:37.6015003Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571'
2025-05-07T20:23:37.6100993Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0'
2025-05-07T20:23:37.6206791Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03'
2025-05-07T20:23:37.6224188Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0
2025-05-07T20:23:37.6559732Z Entering 'external/asmjit'
2025-05-07T20:23:37.6591869Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.6624198Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.6656906Z Entering 'external/cutlass'
2025-05-07T20:23:37.6688609Z Entering 'external/googletest'
2025-05-07T20:23:37.6720422Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.6752556Z Entering 'external/json'
2025-05-07T20:23:37.6796526Z ##[endgroup]
2025-05-07T20:23:37.6796956Z ##[group]Persisting credentials for submodules
2025-05-07T20:23:37.6802275Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :"
2025-05-07T20:23:37.7132256Z Entering 'external/asmjit'
2025-05-07T20:23:37.7178633Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7179307Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7222597Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.7266469Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7266805Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7316889Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.7361397Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7361765Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7403964Z Entering 'external/cutlass'
2025-05-07T20:23:37.7447197Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7447659Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7498771Z Entering 'external/googletest'
2025-05-07T20:23:37.7541981Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7542329Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7584389Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.7627051Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7627501Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7671140Z Entering 'external/json'
2025-05-07T20:23:37.7713567Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7714032Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7774499Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url"
2025-05-07T20:23:37.8109913Z Entering 'external/asmjit'
2025-05-07T20:23:37.8172936Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config	remote.origin.url
2025-05-07T20:23:37.8175945Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.8238572Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config	remote.origin.url
2025-05-07T20:23:37.8241143Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.8302654Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config	remote.origin.url
2025-05-07T20:23:37.8304996Z Entering 'external/cutlass'
2025-05-07T20:23:37.8368667Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config	remote.origin.url
2025-05-07T20:23:37.8370596Z Entering 'external/googletest'
2025-05-07T20:23:37.8431383Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config	remote.origin.url
2025-05-07T20:23:37.8434132Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.8494373Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config	remote.origin.url
2025-05-07T20:23:37.8496956Z Entering 'external/json'
2025-05-07T20:23:37.8561837Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config	remote.origin.url
2025-05-07T20:23:37.8688040Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:'
2025-05-07T20:23:37.9022811Z Entering 'external/asmjit'
2025-05-07T20:23:37.9054966Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.9089198Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.9122100Z Entering 'external/cutlass'
2025-05-07T20:23:37.9155028Z Entering 'external/googletest'
2025-05-07T20:23:37.9187239Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.9219687Z Entering 'external/json'
2025-05-07T20:23:37.9275972Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:'
2025-05-07T20:23:37.9612497Z Entering 'external/asmjit'
2025-05-07T20:23:37.9644599Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.9675794Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.9707513Z Entering 'external/cutlass'
2025-05-07T20:23:37.9739341Z Entering 'external/googletest'
2025-05-07T20:23:37.9771631Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.9803000Z Entering 'external/json'
2025-05-07T20:23:37.9847149Z ##[endgroup]
2025-05-07T20:23:37.9890288Z [command]/usr/bin/git log -1 --format=%H
2025-05-07T20:23:37.9916910Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:38.0114361Z ##[group]Run actions/download-artifact@v4
2025-05-07T20:23:38.0114686Z with:
2025-05-07T20:23:38.0114933Z   name: fbgemm_genai_x86_clang_py3.12_cu12.6.3.whl
2025-05-07T20:23:38.0115262Z   merge-multiple: false
2025-05-07T20:23:38.0115520Z   repository: pytorch/FBGEMM
2025-05-07T20:23:38.0115790Z   run-id: 14891846252
2025-05-07T20:23:38.0116035Z env:
2025-05-07T20:23:38.0116257Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:38.0116551Z   BUILD_ENV: build_binary
2025-05-07T20:23:38.0116797Z   BUILD_TARGET: genai
2025-05-07T20:23:38.0117021Z   BUILD_VARIANT: cuda
2025-05-07T20:23:38.0117260Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:38.0117515Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:38.0117759Z ##[endgroup]
2025-05-07T20:23:38.2480269Z Downloading single artifact
2025-05-07T20:23:38.3481990Z Preparing to download the following artifacts:
2025-05-07T20:23:38.3483042Z - fbgemm_genai_x86_clang_py3.12_cu12.6.3.whl (ID: 3081363158, Size: 12541158, Expected Digest: sha256:373c809c973bf06d642bb3f64051fc1f783379222e7abf42eee25d1e313140af)
2025-05-07T20:23:38.4401503Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-3b1ce936-8478-5297-b5a2-3b87565d3f2f/artifacts/fad341bebf692e31111b4381039b81f54868bd1760453cbce0dfdec7454245cc.zip
2025-05-07T20:23:38.4402971Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:38.5160975Z (node:65567) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2025-05-07T20:23:38.5161964Z (Use `node --trace-deprecation ...` to show where the warning was created)
2025-05-07T20:23:38.6979904Z SHA256 digest of downloaded artifact is 373c809c973bf06d642bb3f64051fc1f783379222e7abf42eee25d1e313140af
2025-05-07T20:23:38.6980507Z Artifact download completed successfully.
2025-05-07T20:23:38.6980874Z Total of 1 artifact(s) downloaded
2025-05-07T20:23:38.6986434Z Download artifact has finished successfully
2025-05-07T20:23:38.7241752Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main
2025-05-07T20:23:38.7242149Z with:
2025-05-07T20:23:38.7242368Z   driver-version: 570.133.07
2025-05-07T20:23:38.7242617Z env:
2025-05-07T20:23:38.7242849Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:38.7243150Z   BUILD_ENV: build_binary
2025-05-07T20:23:38.7243392Z   BUILD_TARGET: genai
2025-05-07T20:23:38.7243624Z   BUILD_VARIANT: cuda
2025-05-07T20:23:38.7243856Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:38.7244113Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:38.7244351Z ##[endgroup]
2025-05-07T20:23:38.7341004Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
2025-05-07T20:23:38.7341412Z with:
2025-05-07T20:23:38.7341631Z   timeout_minutes: 10
2025-05-07T20:23:38.7341876Z   max_attempts: 3
2025-05-07T20:23:38.7366537Z   command: # Is it disgusting to have a full shell script here in this github action? Sure
# But is it the best way to make it so that this action relies on nothing else? Absolutely
set -eou pipefail

DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID)
DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run"

install_nvidia_docker2_amzn2() {
    (
        set -x
        # Needed for yum-config-manager
        sudo yum install -y yum-utils
        if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then
          YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo"
        else
          # Amazon Linux 2
          YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
        fi

        sudo yum-config-manager --add-repo "${YUM_REPO_URL}"
        sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
        sudo systemctl restart docker
    )
}

install_nvidia_docker2_ubuntu20() {
    (
        set -x
        # Install nvidia-driver package if not installed
        status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)"
        if [ ! $? = 0 ] || [ ! "$status" = installed ]; then
          sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
          sudo systemctl restart docker
        fi
    )
}

pre_install_nvidia_driver_amzn2() {
    (
        # Purge any nvidia driver installed from RHEL repo
        sudo yum remove -y nvidia-driver-latest-dkms
    )
}

install_nvidia_driver_common() {
    (
        # Try to gather more information about the runner and its existing NVIDIA driver if any
        echo "Before installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        HAS_NVIDIA_DRIVER=0
        # Check if NVIDIA driver has already been installed
        if [ -x "$(command -v nvidia-smi)" ]; then
            set +e
            # The driver exists, check its version next. Also check only the first GPU if there are more than one of them
            # so that the same driver version is not print over multiple lines
            INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
            NVIDIA_SMI_STATUS=$?

            if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing"
            elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing"

                # Turn off persistent mode so that the installation script can unload the kernel module
                sudo killall nvidia-persistenced || true
            else
                HAS_NVIDIA_DRIVER=1
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation"
            fi
            set -e
        fi

        if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then
            # CAUTION: this may need to be updated in future
            if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then
                  sudo yum groupinstall -y "Development Tools"
                  # ensure our kernel install is the same as our underlying kernel,
                  # groupinstall "Development Tools" has a habit of mismatching kernel headers
                  sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
                  sudo modprobe backlight
            fi
            sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"

            set +e
            sudo /bin/bash /tmp/nvidia_driver -s --no-drm
            NVIDIA_INSTALLATION_STATUS=$?

            RESET_GPU=0
            if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then
                sudo cat /var/log/nvidia-installer.log
                # Fail to install NVIDIA driver, try to reset the GPU
                RESET_GPU=1
            elif [ -x "$(command -v nvidia-smi)" ]; then
                # Check again if nvidia-smi works even if the driver installation completes successfully
                INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
                NVIDIA_SMI_STATUS=$?

                if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                    RESET_GPU=1
                fi
            fi

            if [ "$RESET_GPU" -eq 1 ]; then
                NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1)
                # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this
                # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388
                for PCI_ID in $NVIDIA_DEVICES; do
                    DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable)

                    echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)"
                    # This requires sudo permission of course
                    echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset
                    sleep 1
                done
            fi

            sudo rm -fv /tmp/nvidia_driver
            set -e
        fi
    )
}

post_install_nvidia_driver_common() {
    (
        sudo modprobe nvidia || true
        echo "After installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        (
            set +e

            nvidia-smi
            # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in
            # the case where the driver has already crashed as it still can get the driver version
            # and some basic information like the bus ID.  However, the rest of the information
            # would be missing (ERR!), for example:
            #
            # +-----------------------------------------------------------------------------+
            # | NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
            # |-------------------------------+----------------------+----------------------+
            # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
            # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
            # |                               |                      |               MIG M. |
            # |===============================+======================+======================|
            # |   0  ERR!                Off  | 00000000:00:1E.0 Off |                 ERR! |
            # |ERR!  ERR! ERR!    ERR! / ERR! |   4184MiB / 23028MiB |    ERR!      Default |
            # |                               |                      |                 ERR! |
            # +-------------------------------+----------------------+----------------------+
            #
            # +-----------------------------------------------------------------------------+
            # | Processes:                                                                  |
            # |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
            # |        ID   ID                                                   Usage      |
            # |=============================================================================|
            # +-----------------------------------------------------------------------------+
            #
            # This should be reported as a failure instead as it will guarantee to fail when
            # Docker tries to run with --gpus all
            #
            # So, the correct check here is to query one of the missing piece of info like
            # GPU name, so that the command can fail accordingly
            nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
            NVIDIA_SMI_STATUS=$?

            # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285
            if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then
                echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}"
            else
                echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}"
                exit ${NVIDIA_SMI_STATUS}
            fi
            set -e
        )
    )
}

install_nvidia_driver_amzn2() {
    (
        set -x
        pre_install_nvidia_driver_amzn2
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

install_nvidia_driver_ubuntu20() {
    (
        set -x
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

echo "== Installing nvidia driver ${DRIVER_FN} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_driver_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_driver_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac

# Install container toolkit based on distribution
echo "== Installing nvidia container toolkit for ${DISTRIBUTION} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_docker2_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_docker2_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac
echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"

# Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with
# more than one GPUs. This just needs to be run once. The command fails
# on subsequent runs and complains that the mode is already on, but that's
# ok
sudo nvidia-persistenced || true
# This should show persistence mode ON
nvidia-smi

2025-05-07T20:23:38.7391624Z   retry_wait_seconds: 10
2025-05-07T20:23:38.7391889Z   polling_interval_seconds: 1
2025-05-07T20:23:38.7392153Z   warning_on_retry: true
2025-05-07T20:23:38.7392406Z   continue_on_error: false
2025-05-07T20:23:38.7392649Z env:
2025-05-07T20:23:38.7392865Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:38.7393174Z   BUILD_ENV: build_binary
2025-05-07T20:23:38.7393429Z   BUILD_TARGET: genai
2025-05-07T20:23:38.7393652Z   BUILD_VARIANT: cuda
2025-05-07T20:23:38.7393900Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:38.7394164Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:38.7394406Z   DRIVER_VERSION: 570.133.07
2025-05-07T20:23:38.7412360Z ##[endgroup]
2025-05-07T20:23:38.8252485Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run ==
2025-05-07T20:23:38.8254172Z + pre_install_nvidia_driver_amzn2
2025-05-07T20:23:38.8254668Z + sudo yum remove -y nvidia-driver-latest-dkms
2025-05-07T20:23:39.1704686Z No match for argument: nvidia-driver-latest-dkms
2025-05-07T20:23:39.1705343Z No packages marked for removal.
2025-05-07T20:23:39.1769545Z Dependencies resolved.
2025-05-07T20:23:39.1779691Z Nothing to do.
2025-05-07T20:23:39.1780396Z Complete!
2025-05-07T20:23:39.2611021Z + install_nvidia_driver_common
2025-05-07T20:23:39.2614880Z + echo 'Before installing NVIDIA driver'
2025-05-07T20:23:39.2615419Z + lspci
2025-05-07T20:23:39.2617414Z Before installing NVIDIA driver
2025-05-07T20:23:39.2796820Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:39.2798239Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:39.2799274Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:39.2800460Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:23:39.2801538Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:23:39.2802493Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:39.2803375Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:39.2804251Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:23:39.2804983Z + lsmod
2025-05-07T20:23:39.2851462Z Module                  Size  Used by
2025-05-07T20:23:39.2852067Z xt_nat                 16384  0
2025-05-07T20:23:39.2852590Z nvidia_modeset       1716224  0
2025-05-07T20:23:39.2853132Z video                  65536  1 nvidia_modeset
2025-05-07T20:23:39.2853739Z wmi                    36864  1 video
2025-05-07T20:23:39.2854271Z nvidia_uvm           1884160  0
2025-05-07T20:23:39.2855026Z nvidia              11583488  2 nvidia_uvm,nvidia_modeset
2025-05-07T20:23:39.2855669Z drm                   602112  1 nvidia
2025-05-07T20:23:39.2856269Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:23:39.2856927Z backlight              24576  3 video,drm,nvidia_modeset
2025-05-07T20:23:39.2857318Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:23:39.2857605Z veth                   36864  0
2025-05-07T20:23:39.2857858Z xt_conntrack           16384  1
2025-05-07T20:23:39.2858111Z nft_chain_nat          16384  3
2025-05-07T20:23:39.2858371Z xt_MASQUERADE          20480  1
2025-05-07T20:23:39.2858680Z nf_nat                 57344  3 xt_nat,nft_chain_nat,xt_MASQUERADE
2025-05-07T20:23:39.2859016Z nf_conntrack_netlink    57344  0
2025-05-07T20:23:39.2859644Z nf_conntrack          184320  5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:23:39.2860103Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:23:39.2860420Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:23:39.2860707Z xfrm_user              57344  1
2025-05-07T20:23:39.2860974Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:23:39.2861265Z xt_addrtype            16384  2
2025-05-07T20:23:39.2861520Z nft_compat             20480  4
2025-05-07T20:23:39.2861824Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:23:39.2862241Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:23:39.2862618Z br_netfilter           36864  0
2025-05-07T20:23:39.2862898Z bridge                323584  1 br_netfilter
2025-05-07T20:23:39.2863199Z stp                    16384  1 bridge
2025-05-07T20:23:39.2863490Z llc                    16384  2 bridge,stp
2025-05-07T20:23:39.2863776Z overlay               167936  0
2025-05-07T20:23:39.2864029Z tls                   135168  0
2025-05-07T20:23:39.2864283Z nls_ascii              16384  1
2025-05-07T20:23:39.2864529Z nls_cp437              20480  1
2025-05-07T20:23:39.2864778Z vfat                   24576  1
2025-05-07T20:23:39.2865035Z fat                    86016  1 vfat
2025-05-07T20:23:39.2865304Z ena                   180224  0
2025-05-07T20:23:39.2865550Z i8042                  45056  0
2025-05-07T20:23:39.2865802Z serio                  28672  3 i8042
2025-05-07T20:23:39.2866066Z button                 24576  0
2025-05-07T20:23:39.2866323Z ghash_clmulni_intel    16384  0
2025-05-07T20:23:39.2866583Z sunrpc                696320  1
2025-05-07T20:23:39.2866830Z sch_fq_codel           20480  17
2025-05-07T20:23:39.2867087Z dm_mod                188416  0
2025-05-07T20:23:39.2867331Z fuse                  163840  1
2025-05-07T20:23:39.2867568Z loop                   36864  0
2025-05-07T20:23:39.2867816Z configfs               57344  1
2025-05-07T20:23:39.2868225Z dax                    45056  1 dm_mod
2025-05-07T20:23:39.2868540Z dmi_sysfs              20480  0
2025-05-07T20:23:39.2868939Z crc32_pclmul           16384  0
2025-05-07T20:23:39.2869189Z crc32c_intel           24576  0
2025-05-07T20:23:39.2869439Z efivarfs               24576  1
2025-05-07T20:23:39.2869696Z + modinfo nvidia
2025-05-07T20:23:39.2872378Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:23:39.2872871Z import_ns:      DMA_BUF
2025-05-07T20:23:39.2873122Z alias:          char-major-195-*
2025-05-07T20:23:39.2873392Z version:        570.133.07
2025-05-07T20:23:39.2873632Z supported:      external
2025-05-07T20:23:39.2873880Z license:        Dual MIT/GPL
2025-05-07T20:23:39.2874169Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:23:39.2874504Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:23:39.2874828Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:23:39.2875161Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:23:39.2875505Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:23:39.2875855Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:23:39.2876166Z depends:        i2c-core,drm
2025-05-07T20:23:39.2876462Z retpoline:      Y
2025-05-07T20:23:39.2876679Z name:           nvidia
2025-05-07T20:23:39.2877039Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:23:39.2877524Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:23:39.2877969Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:23:39.2878391Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:23:39.2878702Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:23:39.2878996Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:23:39.2879314Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:23:39.2879619Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:23:39.2880031Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:23:39.2880396Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:23:39.2880785Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:23:39.2881120Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:23:39.2881415Z parm:           NVreg_EnableMSI:int
2025-05-07T20:23:39.2881728Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:23:39.2882090Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:23:39.2882488Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:23:39.2882869Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:23:39.2883284Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:39.2883684Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:23:39.2884105Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:39.2884517Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:23:39.2884860Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:23:39.2885231Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:23:39.2885605Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:23:39.2885947Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:23:39.2886261Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:23:39.2886592Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:23:39.2886916Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:23:39.2887227Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:23:39.2887574Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:23:39.2887938Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:23:39.2888269Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:23:39.2888595Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:23:39.2888938Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:23:39.2889276Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:23:39.2889615Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:23:39.2890044Z parm:           NVreg_RmMsg:charp
2025-05-07T20:23:39.2890335Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:23:39.2890652Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:23:39.2890981Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:23:39.2891297Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:23:39.2891623Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:23:39.2891985Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:23:39.2892343Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:23:39.2892674Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:23:39.2893029Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:23:39.2893378Z parm:           rm_firmware_active:charp
2025-05-07T20:23:39.2893674Z + HAS_NVIDIA_DRIVER=0
2025-05-07T20:23:39.2893912Z ++ command -v nvidia-smi
2025-05-07T20:23:39.2894181Z + '[' -x /usr/bin/nvidia-smi ']'
2025-05-07T20:23:39.2894546Z + set +e
2025-05-07T20:23:39.2894864Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
2025-05-07T20:23:40.9695526Z + INSTALLED_DRIVER_VERSION=570.133.07
2025-05-07T20:23:40.9695976Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:40.9696300Z + '[' 0 -ne 0 ']'
2025-05-07T20:23:40.9696542Z + '[' 570.133.07 '!=' 570.133.07 ']'
2025-05-07T20:23:40.9696806Z + HAS_NVIDIA_DRIVER=1
2025-05-07T20:23:40.9697382Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation'
2025-05-07T20:23:40.9698054Z + set -e
2025-05-07T20:23:40.9698316Z + '[' 1 -eq 0 ']'
2025-05-07T20:23:40.9698702Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation
2025-05-07T20:23:40.9699175Z + post_install_nvidia_driver_common
2025-05-07T20:23:40.9701467Z + sudo modprobe nvidia
2025-05-07T20:23:41.0724250Z + echo 'After installing NVIDIA driver'
2025-05-07T20:23:41.0725299Z + lspci
2025-05-07T20:23:41.0725772Z After installing NVIDIA driver
2025-05-07T20:23:41.0843138Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:41.0843659Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:41.0844218Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:41.0844749Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:23:41.0845229Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:23:41.0845765Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:41.0846256Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:41.0846738Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:23:41.0847139Z + lsmod
2025-05-07T20:23:41.0875546Z Module                  Size  Used by
2025-05-07T20:23:41.0875880Z xt_nat                 16384  0
2025-05-07T20:23:41.0876150Z nvidia_modeset       1716224  0
2025-05-07T20:23:41.0876433Z video                  65536  1 nvidia_modeset
2025-05-07T20:23:41.0876739Z wmi                    36864  1 video
2025-05-07T20:23:41.0877016Z nvidia_uvm           1884160  0
2025-05-07T20:23:41.0877312Z nvidia              11583488  2 nvidia_uvm,nvidia_modeset
2025-05-07T20:23:41.0877644Z drm                   602112  1 nvidia
2025-05-07T20:23:41.0877948Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:23:41.0878305Z backlight              24576  3 video,drm,nvidia_modeset
2025-05-07T20:23:41.0878656Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:23:41.0878943Z veth                   36864  0
2025-05-07T20:23:41.0879195Z xt_conntrack           16384  1
2025-05-07T20:23:41.0879456Z nft_chain_nat          16384  3
2025-05-07T20:23:41.0879718Z xt_MASQUERADE          20480  1
2025-05-07T20:23:41.0880030Z nf_nat                 57344  3 xt_nat,nft_chain_nat,xt_MASQUERADE
2025-05-07T20:23:41.0880376Z nf_conntrack_netlink    57344  0
2025-05-07T20:23:41.0881026Z nf_conntrack          184320  5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:23:41.0881498Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:23:41.0881804Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:23:41.0882102Z xfrm_user              57344  1
2025-05-07T20:23:41.0882371Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:23:41.0882654Z xt_addrtype            16384  2
2025-05-07T20:23:41.0882917Z nft_compat             20480  4
2025-05-07T20:23:41.0883227Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:23:41.0883646Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:23:41.0884015Z br_netfilter           36864  0
2025-05-07T20:23:41.0884298Z bridge                323584  1 br_netfilter
2025-05-07T20:23:41.0884603Z stp                    16384  1 bridge
2025-05-07T20:23:41.0884889Z llc                    16384  2 bridge,stp
2025-05-07T20:23:41.0885191Z overlay               167936  0
2025-05-07T20:23:41.0885461Z tls                   135168  0
2025-05-07T20:23:41.0885712Z nls_ascii              16384  1
2025-05-07T20:23:41.0885987Z nls_cp437              20480  1
2025-05-07T20:23:41.0886245Z vfat                   24576  1
2025-05-07T20:23:41.0886498Z fat                    86016  1 vfat
2025-05-07T20:23:41.0886780Z ena                   180224  0
2025-05-07T20:23:41.0887033Z i8042                  45056  0
2025-05-07T20:23:41.0887284Z serio                  28672  3 i8042
2025-05-07T20:23:41.0887573Z button                 24576  0
2025-05-07T20:23:41.0887844Z ghash_clmulni_intel    16384  0
2025-05-07T20:23:41.0888119Z sunrpc                696320  1
2025-05-07T20:23:41.0888378Z sch_fq_codel           20480  17
2025-05-07T20:23:41.0888653Z dm_mod                188416  0
2025-05-07T20:23:41.0888915Z fuse                  163840  1
2025-05-07T20:23:41.0889178Z loop                   36864  0
2025-05-07T20:23:41.0889620Z configfs               57344  1
2025-05-07T20:23:41.0889893Z dax                    45056  1 dm_mod
2025-05-07T20:23:41.0890179Z dmi_sysfs              20480  0
2025-05-07T20:23:41.0890445Z crc32_pclmul           16384  0
2025-05-07T20:23:41.0890709Z crc32c_intel           24576  0
2025-05-07T20:23:41.0890957Z efivarfs               24576  1
2025-05-07T20:23:41.0891201Z + modinfo nvidia
2025-05-07T20:23:41.0892160Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:23:41.0892619Z import_ns:      DMA_BUF
2025-05-07T20:23:41.0892859Z alias:          char-major-195-*
2025-05-07T20:23:41.0893147Z version:        570.133.07
2025-05-07T20:23:41.0893394Z supported:      external
2025-05-07T20:23:41.0893637Z license:        Dual MIT/GPL
2025-05-07T20:23:41.0893921Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:23:41.0894264Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:23:41.0894736Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:23:41.0895060Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:23:41.0895409Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:23:41.0895752Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:23:41.0896060Z depends:        i2c-core,drm
2025-05-07T20:23:41.0896314Z retpoline:      Y
2025-05-07T20:23:41.0896534Z name:           nvidia
2025-05-07T20:23:41.0896889Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:23:41.0897416Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:23:41.0897866Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:23:41.0898287Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:23:41.0898592Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:23:41.0898901Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:23:41.0899220Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:23:41.0899528Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:23:41.0899835Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:23:41.0900313Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:23:41.0900702Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:23:41.0901043Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:23:41.0901350Z parm:           NVreg_EnableMSI:int
2025-05-07T20:23:41.0901651Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:23:41.0902020Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:23:41.0902428Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:23:41.0902823Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:23:41.0903237Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:41.0903647Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:23:41.0904076Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:41.0904493Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:23:41.0904837Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:23:41.0905208Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:23:41.0905578Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:23:41.0905922Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:23:41.0906247Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:23:41.0906582Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:23:41.0906898Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:23:41.0907218Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:23:41.0907574Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:23:41.0907938Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:23:41.0908274Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:23:41.0908627Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:23:41.0908973Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:23:41.0909421Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:23:41.0909780Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:23:41.0910126Z parm:           NVreg_RmMsg:charp
2025-05-07T20:23:41.0910412Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:23:41.0910748Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:23:41.0911085Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:23:41.0911403Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:23:41.0911745Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:23:41.0912111Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:23:41.0912465Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:23:41.0912803Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:23:41.0913168Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:23:41.0913509Z parm:           rm_firmware_active:charp
2025-05-07T20:23:41.0913803Z + set +e
2025-05-07T20:23:41.0914016Z + nvidia-smi
2025-05-07T20:23:42.5074936Z Wed May  7 20:23:42 2025       
2025-05-07T20:23:42.5075383Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:42.5075894Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:42.5076397Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:42.5076900Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:42.5077433Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:42.5077881Z |                                         |                        |               MIG M. |
2025-05-07T20:23:42.5078218Z |=========================================+========================+======================|
2025-05-07T20:23:42.5138935Z |   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:42.5139923Z |  0%   30C    P0             62W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:42.5140365Z |                                         |                        |                  N/A |
2025-05-07T20:23:42.5140812Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:42.5141265Z                                                                                          
2025-05-07T20:23:42.5141713Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:42.5142203Z | Processes:                                                                              |
2025-05-07T20:23:42.5142708Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:42.5143173Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:42.5143571Z |=========================================================================================|
2025-05-07T20:23:42.5144066Z |  No running processes found                                                             |
2025-05-07T20:23:42.5144552Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:42.9201469Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
2025-05-07T20:23:44.3326182Z NVIDIA A10G
2025-05-07T20:23:44.6034361Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:44.6034723Z + '[' 0 -eq 0 ']'
2025-05-07T20:23:44.6035073Z + echo 'INFO: Ignoring allowed status 0'
2025-05-07T20:23:44.6035481Z + set -e
2025-05-07T20:23:44.6035778Z INFO: Ignoring allowed status 0
2025-05-07T20:23:44.6042785Z == Installing nvidia container toolkit for amzn2023 ==
2025-05-07T20:23:44.6047193Z + sudo yum install -y yum-utils
2025-05-07T20:23:45.0258286Z Last metadata expiration check: 0:17:46 ago on Wed May  7 20:05:59 2025.
2025-05-07T20:23:45.0507342Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed.
2025-05-07T20:23:45.0909964Z Dependencies resolved.
2025-05-07T20:23:45.1092439Z Nothing to do.
2025-05-07T20:23:45.1092896Z Complete!
2025-05-07T20:23:45.1492622Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]]
2025-05-07T20:23:45.1493200Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:45.1494093Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:45.5029162Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:45.5583619Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
2025-05-07T20:23:46.1055726Z nvidia-container-toolkit                         14 kB/s | 833  B     00:00    
2025-05-07T20:23:46.1305548Z Package nvidia-docker2-2.14.0-1.noarch is already installed.
2025-05-07T20:23:46.1712812Z Dependencies resolved.
2025-05-07T20:23:46.1895115Z ================================================================================
2025-05-07T20:23:46.1895585Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:46.1895983Z ================================================================================
2025-05-07T20:23:46.1896315Z Downgrading:
2025-05-07T20:23:46.1896709Z  nvidia-container-toolkit      x86_64 1.16.2-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:46.1897317Z  nvidia-container-toolkit-base x86_64 1.16.2-1   nvidia-container-toolkit 5.6 M
2025-05-07T20:23:46.1897690Z 
2025-05-07T20:23:46.1897788Z Transaction Summary
2025-05-07T20:23:46.1898053Z ================================================================================
2025-05-07T20:23:46.1898373Z Downgrade  2 Packages
2025-05-07T20:23:46.1898528Z 
2025-05-07T20:23:46.1898632Z Total download size: 6.8 M
2025-05-07T20:23:46.1899099Z Downloading Packages:
2025-05-07T20:23:46.2587912Z (1/2): nvidia-container-toolkit-base-1.16.2-1.x  83 MB/s | 5.6 MB     00:00    
2025-05-07T20:23:46.2670214Z (2/2): nvidia-container-toolkit-1.16.2-1.x86_64  16 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:46.2679675Z --------------------------------------------------------------------------------
2025-05-07T20:23:46.2682631Z Total                                            88 MB/s | 6.8 MB     00:00     
2025-05-07T20:23:46.2685078Z Running transaction check
2025-05-07T20:23:46.2788687Z Transaction check succeeded.
2025-05-07T20:23:46.2789105Z Running transaction test
2025-05-07T20:23:46.3082101Z Transaction test succeeded.
2025-05-07T20:23:46.3084568Z Running transaction
2025-05-07T20:23:46.8572998Z   Preparing        :                                                        1/1 
2025-05-07T20:23:46.9645361Z   Downgrading      : nvidia-container-toolkit-base-1.16.2-1.x86_64          1/4 
2025-05-07T20:23:46.9682100Z   Downgrading      : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:46.9906532Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:46.9907375Z   Cleanup          : nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:47.0016140Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:47.0046196Z   Cleanup          : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4 
2025-05-07T20:23:47.1760339Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               4/4 
2025-05-07T20:23:47.1761154Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               1/4 
2025-05-07T20:23:47.1761808Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:47.1762359Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          3/4 
2025-05-07T20:23:47.3068238Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4================================================================================
2025-05-07T20:23:47.3069290Z WARNING:
2025-05-07T20:23:47.3069534Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:47.3069784Z 
2025-05-07T20:23:47.3069881Z   Available Versions:
2025-05-07T20:23:47.3070027Z 
2025-05-07T20:23:47.3070117Z   Version 2023.7.20250331:
2025-05-07T20:23:47.3070433Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:47.3070691Z 
2025-05-07T20:23:47.3070817Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:47.3071031Z 
2025-05-07T20:23:47.3071126Z     Release notes:
2025-05-07T20:23:47.3071537Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:47.3071926Z 
2025-05-07T20:23:47.3072020Z   Version 2023.7.20250414:
2025-05-07T20:23:47.3072328Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:47.3072581Z 
2025-05-07T20:23:47.3072696Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:47.3072919Z 
2025-05-07T20:23:47.3073012Z     Release notes:
2025-05-07T20:23:47.3073426Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:47.3073806Z 
2025-05-07T20:23:47.3073902Z   Version 2023.7.20250428:
2025-05-07T20:23:47.3074209Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:47.3074470Z 
2025-05-07T20:23:47.3074586Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:47.3074799Z 
2025-05-07T20:23:47.3074895Z     Release notes:
2025-05-07T20:23:47.3075293Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:47.3075676Z 
2025-05-07T20:23:47.3075787Z ================================================================================
2025-05-07T20:23:47.3431157Z  
2025-05-07T20:23:47.3431377Z 
2025-05-07T20:23:47.3431503Z Downgraded:
2025-05-07T20:23:47.3431892Z   nvidia-container-toolkit-1.16.2-1.x86_64                                      
2025-05-07T20:23:47.3432477Z   nvidia-container-toolkit-base-1.16.2-1.x86_64                                 
2025-05-07T20:23:47.3432861Z 
2025-05-07T20:23:47.3432951Z Complete!
2025-05-07T20:23:47.3903268Z + sudo systemctl restart docker
2025-05-07T20:23:52.2823744Z Wed May  7 20:23:52 2025       
2025-05-07T20:23:52.2824265Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:52.2824869Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:52.2825372Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:52.2826123Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:52.2826673Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:52.2827125Z |                                         |                        |               MIG M. |
2025-05-07T20:23:52.2827471Z |=========================================+========================+======================|
2025-05-07T20:23:52.2906970Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:52.2907595Z |  0%   30C    P0             62W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:52.2908142Z |                                         |                        |                  N/A |
2025-05-07T20:23:52.2908695Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:52.2909200Z                                                                                          
2025-05-07T20:23:52.2909597Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:52.2910041Z | Processes:                                                                              |
2025-05-07T20:23:52.2910498Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:52.2911267Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:52.2911666Z |=========================================================================================|
2025-05-07T20:23:52.2912283Z |  No running processes found                                                             |
2025-05-07T20:23:52.2912948Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:52.8019215Z Command completed after 1 attempt(s).
2025-05-07T20:23:52.8111624Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info
2025-05-07T20:23:52.8112094Z [36;1m. $PRELUDE; print_system_info; print_ec2_info[0m
2025-05-07T20:23:52.8126943Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:52.8127317Z env:
2025-05-07T20:23:52.8127558Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:52.8127872Z   BUILD_ENV: build_binary
2025-05-07T20:23:52.8128131Z   BUILD_TARGET: genai
2025-05-07T20:23:52.8128385Z   BUILD_VARIANT: cuda
2025-05-07T20:23:52.8128625Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:52.8128891Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:52.8129209Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:52.8129551Z ##[endgroup]
2025-05-07T20:23:53.1524368Z ################################################################################
2025-05-07T20:23:53.1524758Z # Print System Info
2025-05-07T20:23:53.1524983Z #
2025-05-07T20:23:53.1541284Z # [2025-05-07T20:23:53.153Z] + print_system_info 
2025-05-07T20:23:53.1541659Z ################################################################################
2025-05-07T20:23:53.1541885Z 
2025-05-07T20:23:53.1541998Z ################################################################################
2025-05-07T20:23:53.1542339Z [INFO] Printing environment variables ...
2025-05-07T20:23:53.1542646Z + printenv
2025-05-07T20:23:53.1542763Z 
2025-05-07T20:23:53.1563791Z SHELL=/bin/bash
2025-05-07T20:23:53.1564174Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:53.1564579Z BUILD_VARIANT=cuda
2025-05-07T20:23:53.1565117Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_a588dc6e-a5c1-454e-a010-4d33cdea37e3
2025-05-07T20:23:53.1565710Z GITHUB_ACTION=__run
2025-05-07T20:23:53.1566001Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:53.1566341Z GITHUB_RUN_NUMBER=10601
2025-05-07T20:23:53.1566586Z RUNNER_NAME=i-02a13dec7b575dc8f
2025-05-07T20:23:53.1566878Z GITHUB_REPOSITORY_OWNER_ID=21003710
2025-05-07T20:23:53.1567185Z PLATFORM_NAME_LC=linux-x86_64
2025-05-07T20:23:53.1567513Z MACHINE_NAME_LC=x86_64
2025-05-07T20:23:53.1568036Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh
2025-05-07T20:23:53.1568642Z GITHUB_TRIGGERING_ACTOR=q10
2025-05-07T20:23:53.1568971Z PRELUDE=.github/scripts/setup_env.bash
2025-05-07T20:23:53.1569273Z GITHUB_REF_TYPE=branch
2025-05-07T20:23:53.1569738Z ***
2025-05-07T20:23:53.1569991Z LOGNAME=ec2-user
2025-05-07T20:23:53.1570232Z GITHUB_REPOSITORY_ID=150154628
2025-05-07T20:23:53.1570497Z ENFORCE_CUDA_DEVICE=1
2025-05-07T20:23:53.1570735Z GITHUB_ACTIONS=true
2025-05-07T20:23:53.1570962Z SYSTEMD_EXEC_PID=55382
2025-05-07T20:23:53.1571242Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:53.1571808Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge
2025-05-07T20:23:53.1572339Z RUNNER_ENVIRONMENT=self-hosted
2025-05-07T20:23:53.1572621Z GITHUB_REF=refs/pull/4066/merge
2025-05-07T20:23:53.1572887Z RUNNER_OS=Linux
2025-05-07T20:23:53.1573116Z GITHUB_REF_PROTECTED=false
2025-05-07T20:23:53.1573362Z HOME=/home/ec2-user
2025-05-07T20:23:53.1573628Z GITHUB_API_URL=https://api.github.com
2025-05-07T20:23:53.1573932Z LANG=C.UTF-8
2025-05-07T20:23:53.1574229Z RUNNER_TRACKING_ID=github_9125985d-0653-4ab0-94d0-9e9fb9cb14a2
2025-05-07T20:23:53.1574706Z RUNNER_ARCH=X64
2025-05-07T20:23:53.1574980Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp
2025-05-07T20:23:53.1575637Z BUILD_TARGET=genai
2025-05-07T20:23:53.1576179Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_a588dc6e-a5c1-454e-a010-4d33cdea37e3
2025-05-07T20:23:53.1577085Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_a588dc6e-a5c1-454e-a010-4d33cdea37e3
2025-05-07T20:23:53.1577861Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json
2025-05-07T20:23:53.1578550Z INVOCATION_ID=b55d6cb2507b4fe896b3815e87d2f4e7
2025-05-07T20:23:53.1578892Z GITHUB_EVENT_NAME=pull_request
2025-05-07T20:23:53.1579167Z GITHUB_RUN_ID=14891846252
2025-05-07T20:23:53.1579772Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_a588dc6e-a5c1-454e-a010-4d33cdea37e3
2025-05-07T20:23:53.1580415Z BUILD_ENV=build_binary
2025-05-07T20:23:53.1580656Z GITHUB_ACTOR=q10
2025-05-07T20:23:53.1580888Z GITHUB_RUN_ATTEMPT=1
2025-05-07T20:23:53.1581129Z KERN_NAME_LC=linux
2025-05-07T20:23:53.1581363Z BUILD_CUDA_VERSION=12.6.3
2025-05-07T20:23:53.1581675Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql
2025-05-07T20:23:53.1582030Z PLATFORM_NAME=Linux-x86_64
2025-05-07T20:23:53.1582290Z USER=ec2-user
2025-05-07T20:23:53.1582534Z GITHUB_SERVER_URL=https://github.com
2025-05-07T20:23:53.1582824Z SHLVL=1
2025-05-07T20:23:53.1583032Z GITHUB_ACTOR_ID=255046
2025-05-07T20:23:53.1583356Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool
2025-05-07T20:23:53.1583840Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e
2025-05-07T20:23:53.1584281Z GITHUB_REF_NAME=4066/merge
2025-05-07T20:23:53.1584526Z KERN_NAME=Linux
2025-05-07T20:23:53.1584761Z GITHUB_JOB=test_and_publish_artifact
2025-05-07T20:23:53.1585171Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh
2025-05-07T20:23:53.1585607Z GITHUB_REPOSITORY=pytorch/FBGEMM
2025-05-07T20:23:53.1585883Z GITHUB_RETENTION_DAYS=90
2025-05-07T20:23:53.1586120Z JOURNAL_STREAM=8:83617
2025-05-07T20:23:53.1586437Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM
2025-05-07T20:23:53.1586813Z GITHUB_ACTION_REPOSITORY=
2025-05-07T20:23:53.1587118Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
2025-05-07T20:23:53.1587457Z GITHUB_BASE_REF=main
2025-05-07T20:23:53.1587674Z CI=true
2025-05-07T20:23:53.1587878Z GITHUB_REPOSITORY_OWNER=pytorch
2025-05-07T20:23:53.1588163Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6
2025-05-07T20:23:53.1588449Z GITHUB_ACTION_REF=
2025-05-07T20:23:53.1588691Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI
2025-05-07T20:23:53.1589317Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_a588dc6e-a5c1-454e-a010-4d33cdea37e3
2025-05-07T20:23:53.1589923Z MACHINE_NAME=x86_64
2025-05-07T20:23:53.1590142Z _=/usr/bin/printenv
2025-05-07T20:23:53.1590272Z 
2025-05-07T20:23:53.1590386Z ################################################################################
2025-05-07T20:23:53.1590708Z [INFO] Print ldd version ...
2025-05-07T20:23:53.1590974Z + ldd --version
2025-05-07T20:23:53.1591101Z 
2025-05-07T20:23:53.1591198Z ldd (GNU libc) 2.34
2025-05-07T20:23:53.1591473Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:23:53.1591934Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:23:53.1592491Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:23:53.1592957Z Written by Roland McGrath and Ulrich Drepper.
2025-05-07T20:23:53.1593193Z 
2025-05-07T20:23:53.1593312Z ################################################################################
2025-05-07T20:23:53.1593635Z [INFO] Print CPU info ...
2025-05-07T20:23:53.1593880Z + nproc
2025-05-07T20:23:53.1593997Z 
2025-05-07T20:23:53.1611258Z 16
2025-05-07T20:23:53.1612892Z 
2025-05-07T20:23:53.1613077Z + lscpu
2025-05-07T20:23:53.1613191Z 
2025-05-07T20:23:53.1724611Z Architecture:                         x86_64
2025-05-07T20:23:53.1725374Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:23:53.1727105Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1727920Z Byte Order:                           Little Endian
2025-05-07T20:23:53.1728773Z CPU(s):                               16
2025-05-07T20:23:53.1729431Z On-line CPU(s) list:                  0-15
2025-05-07T20:23:53.1730067Z Vendor ID:                            AuthenticAMD
2025-05-07T20:23:53.1730473Z Model name:                           AMD EPYC 7R32
2025-05-07T20:23:53.1730814Z CPU family:                           23
2025-05-07T20:23:53.1731440Z Model:                                49
2025-05-07T20:23:53.1731748Z Thread(s) per core:                   2
2025-05-07T20:23:53.1732049Z Core(s) per socket:                   8
2025-05-07T20:23:53.1732344Z Socket(s):                            1
2025-05-07T20:23:53.1732628Z Stepping:                             0
2025-05-07T20:23:53.1732937Z BogoMIPS:                             5599.29
2025-05-07T20:23:53.1735298Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1737551Z Hypervisor vendor:                    KVM
2025-05-07T20:23:53.1737876Z Virtualization type:                  full
2025-05-07T20:23:53.1738268Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:23:53.1738645Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:23:53.1739020Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:23:53.1739391Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:23:53.1739725Z NUMA node(s):                         1
2025-05-07T20:23:53.1740027Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:23:53.1740476Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:23:53.1741024Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:23:53.1741538Z Vulnerability L1tf:                   Not affected
2025-05-07T20:23:53.1742034Z Vulnerability Mds:                    Not affected
2025-05-07T20:23:53.1742539Z Vulnerability Meltdown:               Not affected
2025-05-07T20:23:53.1743036Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:23:53.1743540Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:23:53.1744097Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:23:53.1744687Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:23:53.1745248Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:23:53.1745951Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:23:53.1746833Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:23:53.1747521Z Vulnerability Srbds:                  Not affected
2025-05-07T20:23:53.1747890Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:23:53.1748208Z 
2025-05-07T20:23:53.1748311Z + cat /proc/cpuinfo
2025-05-07T20:23:53.1748449Z 
2025-05-07T20:23:53.1748538Z processor	: 0
2025-05-07T20:23:53.1748751Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1748994Z cpu family	: 23
2025-05-07T20:23:53.1749203Z model		: 49
2025-05-07T20:23:53.1749404Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1749651Z stepping	: 0
2025-05-07T20:23:53.1749859Z microcode	: 0x830107f
2025-05-07T20:23:53.1750196Z cpu MHz		: 2897.609
2025-05-07T20:23:53.1750417Z cache size	: 512 KB
2025-05-07T20:23:53.1750630Z physical id	: 0
2025-05-07T20:23:53.1750830Z siblings	: 16
2025-05-07T20:23:53.1751033Z core id		: 0
2025-05-07T20:23:53.1751232Z cpu cores	: 8
2025-05-07T20:23:53.1751430Z apicid		: 0
2025-05-07T20:23:53.1751630Z initial apicid	: 0
2025-05-07T20:23:53.1751838Z fpu		: yes
2025-05-07T20:23:53.1752037Z fpu_exception	: yes
2025-05-07T20:23:53.1752255Z cpuid level	: 13
2025-05-07T20:23:53.1752459Z wp		: yes
2025-05-07T20:23:53.1754653Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1757060Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1757571Z bogomips	: 5599.29
2025-05-07T20:23:53.1757797Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1758039Z clflush size	: 64
2025-05-07T20:23:53.1758252Z cache_alignment	: 64
2025-05-07T20:23:53.1758527Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1758864Z power management:
2025-05-07T20:23:53.1758996Z 
2025-05-07T20:23:53.1759082Z processor	: 1
2025-05-07T20:23:53.1759303Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1759544Z cpu family	: 23
2025-05-07T20:23:53.1759749Z model		: 49
2025-05-07T20:23:53.1759962Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1760216Z stepping	: 0
2025-05-07T20:23:53.1760420Z microcode	: 0x830107f
2025-05-07T20:23:53.1760675Z cpu MHz		: 2154.332
2025-05-07T20:23:53.1760917Z cache size	: 512 KB
2025-05-07T20:23:53.1761134Z physical id	: 0
2025-05-07T20:23:53.1761345Z siblings	: 16
2025-05-07T20:23:53.1761549Z core id		: 1
2025-05-07T20:23:53.1761750Z cpu cores	: 8
2025-05-07T20:23:53.1761954Z apicid		: 2
2025-05-07T20:23:53.1762159Z initial apicid	: 2
2025-05-07T20:23:53.1762367Z fpu		: yes
2025-05-07T20:23:53.1762571Z fpu_exception	: yes
2025-05-07T20:23:53.1762791Z cpuid level	: 13
2025-05-07T20:23:53.1762997Z wp		: yes
2025-05-07T20:23:53.1765099Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1767493Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1768000Z bogomips	: 5599.29
2025-05-07T20:23:53.1768224Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1768460Z clflush size	: 64
2025-05-07T20:23:53.1815890Z cache_alignment	: 64
2025-05-07T20:23:53.1816250Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1816590Z power management:
2025-05-07T20:23:53.1816729Z 
2025-05-07T20:23:53.1816837Z processor	: 2
2025-05-07T20:23:53.1817061Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1817367Z cpu family	: 23
2025-05-07T20:23:53.1817624Z model		: 49
2025-05-07T20:23:53.1817848Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1818145Z stepping	: 0
2025-05-07T20:23:53.1818437Z microcode	: 0x830107f
2025-05-07T20:23:53.1818671Z cpu MHz		: 2375.804
2025-05-07T20:23:53.1818898Z cache size	: 512 KB
2025-05-07T20:23:53.1819125Z physical id	: 0
2025-05-07T20:23:53.1819507Z siblings	: 16
2025-05-07T20:23:53.1819716Z core id		: 2
2025-05-07T20:23:53.1819920Z cpu cores	: 8
2025-05-07T20:23:53.1820118Z apicid		: 4
2025-05-07T20:23:53.1820320Z initial apicid	: 4
2025-05-07T20:23:53.1820536Z fpu		: yes
2025-05-07T20:23:53.1820730Z fpu_exception	: yes
2025-05-07T20:23:53.1820949Z cpuid level	: 13
2025-05-07T20:23:53.1821156Z wp		: yes
2025-05-07T20:23:53.1823373Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1826097Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1826677Z bogomips	: 5599.29
2025-05-07T20:23:53.1826918Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1827172Z clflush size	: 64
2025-05-07T20:23:53.1827400Z cache_alignment	: 64
2025-05-07T20:23:53.1827700Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1828055Z power management:
2025-05-07T20:23:53.1828199Z 
2025-05-07T20:23:53.1828286Z processor	: 3
2025-05-07T20:23:53.1828514Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1828776Z cpu family	: 23
2025-05-07T20:23:53.1828988Z model		: 49
2025-05-07T20:23:53.1829205Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1829463Z stepping	: 0
2025-05-07T20:23:53.1829677Z microcode	: 0x830107f
2025-05-07T20:23:53.1829917Z cpu MHz		: 3289.591
2025-05-07T20:23:53.1830141Z cache size	: 512 KB
2025-05-07T20:23:53.1830362Z physical id	: 0
2025-05-07T20:23:53.1830583Z siblings	: 16
2025-05-07T20:23:53.1830792Z core id		: 3
2025-05-07T20:23:53.1830998Z cpu cores	: 8
2025-05-07T20:23:53.1831209Z apicid		: 6
2025-05-07T20:23:53.1831422Z initial apicid	: 6
2025-05-07T20:23:53.1831642Z fpu		: yes
2025-05-07T20:23:53.1831849Z fpu_exception	: yes
2025-05-07T20:23:53.1832077Z cpuid level	: 13
2025-05-07T20:23:53.1832299Z wp		: yes
2025-05-07T20:23:53.1834752Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1837130Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1837636Z bogomips	: 5599.29
2025-05-07T20:23:53.1837856Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1838081Z clflush size	: 64
2025-05-07T20:23:53.1838297Z cache_alignment	: 64
2025-05-07T20:23:53.1838567Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1838882Z power management:
2025-05-07T20:23:53.1839017Z 
2025-05-07T20:23:53.1839095Z processor	: 4
2025-05-07T20:23:53.1839305Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1839541Z cpu family	: 23
2025-05-07T20:23:53.1839735Z model		: 49
2025-05-07T20:23:53.1839958Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1840234Z stepping	: 0
2025-05-07T20:23:53.1840432Z microcode	: 0x830107f
2025-05-07T20:23:53.1840653Z cpu MHz		: 3299.649
2025-05-07T20:23:53.1840859Z cache size	: 512 KB
2025-05-07T20:23:53.1841063Z physical id	: 0
2025-05-07T20:23:53.1841273Z siblings	: 16
2025-05-07T20:23:53.1841471Z core id		: 4
2025-05-07T20:23:53.1841658Z cpu cores	: 8
2025-05-07T20:23:53.1841854Z apicid		: 8
2025-05-07T20:23:53.1842251Z initial apicid	: 8
2025-05-07T20:23:53.1842457Z fpu		: yes
2025-05-07T20:23:53.1842717Z fpu_exception	: yes
2025-05-07T20:23:53.1842939Z cpuid level	: 13
2025-05-07T20:23:53.1843133Z wp		: yes
2025-05-07T20:23:53.1845348Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1847734Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1848239Z bogomips	: 5599.29
2025-05-07T20:23:53.1848470Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1848703Z clflush size	: 64
2025-05-07T20:23:53.1848924Z cache_alignment	: 64
2025-05-07T20:23:53.1849193Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1849513Z power management:
2025-05-07T20:23:53.1849698Z 
2025-05-07T20:23:53.1849810Z processor	: 5
2025-05-07T20:23:53.1850075Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1850312Z cpu family	: 23
2025-05-07T20:23:53.1850514Z model		: 49
2025-05-07T20:23:53.1850723Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1850969Z stepping	: 0
2025-05-07T20:23:53.1851183Z microcode	: 0x830107f
2025-05-07T20:23:53.1851410Z cpu MHz		: 3298.817
2025-05-07T20:23:53.1851631Z cache size	: 512 KB
2025-05-07T20:23:53.1851847Z physical id	: 0
2025-05-07T20:23:53.1852068Z siblings	: 16
2025-05-07T20:23:53.1852269Z core id		: 5
2025-05-07T20:23:53.1852469Z cpu cores	: 8
2025-05-07T20:23:53.1852671Z apicid		: 10
2025-05-07T20:23:53.1852874Z initial apicid	: 10
2025-05-07T20:23:53.1853094Z fpu		: yes
2025-05-07T20:23:53.1853304Z fpu_exception	: yes
2025-05-07T20:23:53.1853518Z cpuid level	: 13
2025-05-07T20:23:53.1853727Z wp		: yes
2025-05-07T20:23:53.1855931Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1858333Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1858844Z bogomips	: 5599.29
2025-05-07T20:23:53.1859065Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1859312Z clflush size	: 64
2025-05-07T20:23:53.1859545Z cache_alignment	: 64
2025-05-07T20:23:53.1859820Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1860150Z power management:
2025-05-07T20:23:53.1860282Z 
2025-05-07T20:23:53.1860381Z processor	: 6
2025-05-07T20:23:53.1860668Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1860988Z cpu family	: 23
2025-05-07T20:23:53.1861237Z model		: 49
2025-05-07T20:23:53.1861432Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1861674Z stepping	: 0
2025-05-07T20:23:53.1861873Z microcode	: 0x830107f
2025-05-07T20:23:53.1862097Z cpu MHz		: 1940.260
2025-05-07T20:23:53.1862304Z cache size	: 512 KB
2025-05-07T20:23:53.1862511Z physical id	: 0
2025-05-07T20:23:53.1862707Z siblings	: 16
2025-05-07T20:23:53.1862900Z core id		: 6
2025-05-07T20:23:53.1863099Z cpu cores	: 8
2025-05-07T20:23:53.1863294Z apicid		: 12
2025-05-07T20:23:53.1863480Z initial apicid	: 12
2025-05-07T20:23:53.1863681Z fpu		: yes
2025-05-07T20:23:53.1863877Z fpu_exception	: yes
2025-05-07T20:23:53.1864186Z cpuid level	: 13
2025-05-07T20:23:53.1864388Z wp		: yes
2025-05-07T20:23:53.1866564Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1869115Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1869615Z bogomips	: 5599.29
2025-05-07T20:23:53.1869848Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1870132Z clflush size	: 64
2025-05-07T20:23:53.1870344Z cache_alignment	: 64
2025-05-07T20:23:53.1870624Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1870947Z power management:
2025-05-07T20:23:53.1871077Z 
2025-05-07T20:23:53.1871167Z processor	: 7
2025-05-07T20:23:53.1871429Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1871665Z cpu family	: 23
2025-05-07T20:23:53.1871871Z model		: 49
2025-05-07T20:23:53.1872066Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1872304Z stepping	: 0
2025-05-07T20:23:53.1872514Z microcode	: 0x830107f
2025-05-07T20:23:53.1872728Z cpu MHz		: 2583.007
2025-05-07T20:23:53.1872947Z cache size	: 512 KB
2025-05-07T20:23:53.1873161Z physical id	: 0
2025-05-07T20:23:53.1873366Z siblings	: 16
2025-05-07T20:23:53.1873580Z core id		: 7
2025-05-07T20:23:53.1873786Z cpu cores	: 8
2025-05-07T20:23:53.1873982Z apicid		: 14
2025-05-07T20:23:53.1874183Z initial apicid	: 14
2025-05-07T20:23:53.1874401Z fpu		: yes
2025-05-07T20:23:53.1874590Z fpu_exception	: yes
2025-05-07T20:23:53.1874810Z cpuid level	: 13
2025-05-07T20:23:53.1875018Z wp		: yes
2025-05-07T20:23:53.1877118Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1879506Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1880013Z bogomips	: 5599.29
2025-05-07T20:23:53.1880231Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1880469Z clflush size	: 64
2025-05-07T20:23:53.1880678Z cache_alignment	: 64
2025-05-07T20:23:53.1880942Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1881263Z power management:
2025-05-07T20:23:53.1881394Z 
2025-05-07T20:23:53.1881473Z processor	: 8
2025-05-07T20:23:53.1881695Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1881929Z cpu family	: 23
2025-05-07T20:23:53.1882132Z model		: 49
2025-05-07T20:23:53.1882344Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1882593Z stepping	: 0
2025-05-07T20:23:53.1882802Z microcode	: 0x830107f
2025-05-07T20:23:53.1883037Z cpu MHz		: 1977.170
2025-05-07T20:23:53.1883253Z cache size	: 512 KB
2025-05-07T20:23:53.1883466Z physical id	: 0
2025-05-07T20:23:53.1883682Z siblings	: 16
2025-05-07T20:23:53.1883888Z core id		: 0
2025-05-07T20:23:53.1884096Z cpu cores	: 8
2025-05-07T20:23:53.1884293Z apicid		: 1
2025-05-07T20:23:53.1884500Z initial apicid	: 1
2025-05-07T20:23:53.1884716Z fpu		: yes
2025-05-07T20:23:53.1884913Z fpu_exception	: yes
2025-05-07T20:23:53.1885128Z cpuid level	: 13
2025-05-07T20:23:53.1885341Z wp		: yes
2025-05-07T20:23:53.1887439Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1890069Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1890580Z bogomips	: 5599.29
2025-05-07T20:23:53.1890805Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1891046Z clflush size	: 64
2025-05-07T20:23:53.1891261Z cache_alignment	: 64
2025-05-07T20:23:53.1891534Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1891856Z power management:
2025-05-07T20:23:53.1891989Z 
2025-05-07T20:23:53.1892081Z processor	: 9
2025-05-07T20:23:53.1892299Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1892539Z cpu family	: 23
2025-05-07T20:23:53.1892745Z model		: 49
2025-05-07T20:23:53.1892954Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1893199Z stepping	: 0
2025-05-07T20:23:53.1893405Z microcode	: 0x830107f
2025-05-07T20:23:53.1893634Z cpu MHz		: 3026.540
2025-05-07T20:23:53.1893850Z cache size	: 512 KB
2025-05-07T20:23:53.1894064Z physical id	: 0
2025-05-07T20:23:53.1894277Z siblings	: 16
2025-05-07T20:23:53.1894581Z core id		: 1
2025-05-07T20:23:53.1894797Z cpu cores	: 8
2025-05-07T20:23:53.1895009Z apicid		: 3
2025-05-07T20:23:53.1895206Z initial apicid	: 3
2025-05-07T20:23:53.1895413Z fpu		: yes
2025-05-07T20:23:53.1895613Z fpu_exception	: yes
2025-05-07T20:23:53.1895831Z cpuid level	: 13
2025-05-07T20:23:53.1896039Z wp		: yes
2025-05-07T20:23:53.1898139Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1900544Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1901055Z bogomips	: 5599.29
2025-05-07T20:23:53.1901276Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1901509Z clflush size	: 64
2025-05-07T20:23:53.1901726Z cache_alignment	: 64
2025-05-07T20:23:53.1901999Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1902313Z power management:
2025-05-07T20:23:53.1902452Z 
2025-05-07T20:23:53.1902538Z processor	: 10
2025-05-07T20:23:53.1902755Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1902992Z cpu family	: 23
2025-05-07T20:23:53.1903200Z model		: 49
2025-05-07T20:23:53.1903407Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1903643Z stepping	: 0
2025-05-07T20:23:53.1903856Z microcode	: 0x830107f
2025-05-07T20:23:53.1904086Z cpu MHz		: 1832.408
2025-05-07T20:23:53.1904293Z cache size	: 512 KB
2025-05-07T20:23:53.1904511Z physical id	: 0
2025-05-07T20:23:53.1904726Z siblings	: 16
2025-05-07T20:23:53.1904930Z core id		: 2
2025-05-07T20:23:53.1905127Z cpu cores	: 8
2025-05-07T20:23:53.1905330Z apicid		: 5
2025-05-07T20:23:53.1905539Z initial apicid	: 5
2025-05-07T20:23:53.1905746Z fpu		: yes
2025-05-07T20:23:53.1905947Z fpu_exception	: yes
2025-05-07T20:23:53.1906166Z cpuid level	: 13
2025-05-07T20:23:53.1906371Z wp		: yes
2025-05-07T20:23:53.1908464Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1910935Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1911435Z bogomips	: 5599.29
2025-05-07T20:23:53.1911738Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1911985Z clflush size	: 64
2025-05-07T20:23:53.1912207Z cache_alignment	: 64
2025-05-07T20:23:53.1912469Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1912792Z power management:
2025-05-07T20:23:53.1912928Z 
2025-05-07T20:23:53.1913011Z processor	: 11
2025-05-07T20:23:53.1913226Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1913459Z cpu family	: 23
2025-05-07T20:23:53.1913672Z model		: 49
2025-05-07T20:23:53.1913877Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1914115Z stepping	: 0
2025-05-07T20:23:53.1914324Z microcode	: 0x830107f
2025-05-07T20:23:53.1914550Z cpu MHz		: 3289.948
2025-05-07T20:23:53.1914761Z cache size	: 512 KB
2025-05-07T20:23:53.1914977Z physical id	: 0
2025-05-07T20:23:53.1915185Z siblings	: 16
2025-05-07T20:23:53.1915382Z core id		: 3
2025-05-07T20:23:53.1915589Z cpu cores	: 8
2025-05-07T20:23:53.1915791Z apicid		: 7
2025-05-07T20:23:53.1915985Z initial apicid	: 7
2025-05-07T20:23:53.1916203Z fpu		: yes
2025-05-07T20:23:53.1916406Z fpu_exception	: yes
2025-05-07T20:23:53.1916622Z cpuid level	: 13
2025-05-07T20:23:53.1916828Z wp		: yes
2025-05-07T20:23:53.1918928Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1921320Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1921829Z bogomips	: 5599.29
2025-05-07T20:23:53.1922048Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1922291Z clflush size	: 64
2025-05-07T20:23:53.1922514Z cache_alignment	: 64
2025-05-07T20:23:53.1922784Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1923113Z power management:
2025-05-07T20:23:53.1923248Z 
2025-05-07T20:23:53.1923341Z processor	: 12
2025-05-07T20:23:53.1923560Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1923808Z cpu family	: 23
2025-05-07T20:23:53.1924022Z model		: 49
2025-05-07T20:23:53.1924230Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1924486Z stepping	: 0
2025-05-07T20:23:53.1924706Z microcode	: 0x830107f
2025-05-07T20:23:53.1924933Z cpu MHz		: 3300.128
2025-05-07T20:23:53.1925156Z cache size	: 512 KB
2025-05-07T20:23:53.1925384Z physical id	: 0
2025-05-07T20:23:53.1925870Z siblings	: 16
2025-05-07T20:23:53.1926083Z core id		: 4
2025-05-07T20:23:53.1926293Z cpu cores	: 8
2025-05-07T20:23:53.1926499Z apicid		: 9
2025-05-07T20:23:53.1926708Z initial apicid	: 9
2025-05-07T20:23:53.1926929Z fpu		: yes
2025-05-07T20:23:53.1927130Z fpu_exception	: yes
2025-05-07T20:23:53.1927363Z cpuid level	: 13
2025-05-07T20:23:53.1927581Z wp		: yes
2025-05-07T20:23:53.1929687Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1932294Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1932800Z bogomips	: 5599.29
2025-05-07T20:23:53.1933031Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1933276Z clflush size	: 64
2025-05-07T20:23:53.1933498Z cache_alignment	: 64
2025-05-07T20:23:53.1933904Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1934240Z power management:
2025-05-07T20:23:53.1934378Z 
2025-05-07T20:23:53.1934562Z processor	: 13
2025-05-07T20:23:53.1934785Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1935022Z cpu family	: 23
2025-05-07T20:23:53.1935227Z model		: 49
2025-05-07T20:23:53.1935438Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1935687Z stepping	: 0
2025-05-07T20:23:53.1935897Z microcode	: 0x830107f
2025-05-07T20:23:53.1936134Z cpu MHz		: 3300.583
2025-05-07T20:23:53.1936352Z cache size	: 512 KB
2025-05-07T20:23:53.1936569Z physical id	: 0
2025-05-07T20:23:53.1936773Z siblings	: 16
2025-05-07T20:23:53.1936972Z core id		: 5
2025-05-07T20:23:53.1937170Z cpu cores	: 8
2025-05-07T20:23:53.1937365Z apicid		: 11
2025-05-07T20:23:53.1937575Z initial apicid	: 11
2025-05-07T20:23:53.1937788Z fpu		: yes
2025-05-07T20:23:53.1937987Z fpu_exception	: yes
2025-05-07T20:23:53.1938201Z cpuid level	: 13
2025-05-07T20:23:53.1938406Z wp		: yes
2025-05-07T20:23:53.1940508Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1942904Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1943409Z bogomips	: 5599.29
2025-05-07T20:23:53.1943645Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1943879Z clflush size	: 64
2025-05-07T20:23:53.1944099Z cache_alignment	: 64
2025-05-07T20:23:53.1944371Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1944694Z power management:
2025-05-07T20:23:53.1944832Z 
2025-05-07T20:23:53.1944919Z processor	: 14
2025-05-07T20:23:53.1945143Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1945384Z cpu family	: 23
2025-05-07T20:23:53.1945589Z model		: 49
2025-05-07T20:23:53.1945795Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1946044Z stepping	: 0
2025-05-07T20:23:53.1946252Z microcode	: 0x830107f
2025-05-07T20:23:53.1946486Z cpu MHz		: 1789.141
2025-05-07T20:23:53.1946707Z cache size	: 512 KB
2025-05-07T20:23:53.1946929Z physical id	: 0
2025-05-07T20:23:53.1947140Z siblings	: 16
2025-05-07T20:23:53.1947346Z core id		: 6
2025-05-07T20:23:53.1947546Z cpu cores	: 8
2025-05-07T20:23:53.1947751Z apicid		: 13
2025-05-07T20:23:53.1947960Z initial apicid	: 13
2025-05-07T20:23:53.1948170Z fpu		: yes
2025-05-07T20:23:53.1948379Z fpu_exception	: yes
2025-05-07T20:23:53.1948597Z cpuid level	: 13
2025-05-07T20:23:53.1948802Z wp		: yes
2025-05-07T20:23:53.1950918Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1955053Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1955570Z bogomips	: 5599.29
2025-05-07T20:23:53.1955801Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1956034Z clflush size	: 64
2025-05-07T20:23:53.1956256Z cache_alignment	: 64
2025-05-07T20:23:53.1956530Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1956851Z power management:
2025-05-07T20:23:53.1956991Z 
2025-05-07T20:23:53.1957191Z processor	: 15
2025-05-07T20:23:53.1957406Z vendor_id	: AuthenticAMD
2025-05-07T20:23:53.1957639Z cpu family	: 23
2025-05-07T20:23:53.1957843Z model		: 49
2025-05-07T20:23:53.1958044Z model name	: AMD EPYC 7R32
2025-05-07T20:23:53.1958276Z stepping	: 0
2025-05-07T20:23:53.1958477Z microcode	: 0x830107f
2025-05-07T20:23:53.1958697Z cpu MHz		: 1758.835
2025-05-07T20:23:53.1958908Z cache size	: 512 KB
2025-05-07T20:23:53.1959126Z physical id	: 0
2025-05-07T20:23:53.1959349Z siblings	: 16
2025-05-07T20:23:53.1959542Z core id		: 7
2025-05-07T20:23:53.1959745Z cpu cores	: 8
2025-05-07T20:23:53.1959979Z apicid		: 15
2025-05-07T20:23:53.1960208Z initial apicid	: 15
2025-05-07T20:23:53.1960433Z fpu		: yes
2025-05-07T20:23:53.1960645Z fpu_exception	: yes
2025-05-07T20:23:53.1960856Z cpuid level	: 13
2025-05-07T20:23:53.1961071Z wp		: yes
2025-05-07T20:23:53.1963181Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:53.1973149Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:53.1973661Z bogomips	: 5599.29
2025-05-07T20:23:53.1973881Z TLB size	: 3072 4K pages
2025-05-07T20:23:53.1974129Z clflush size	: 64
2025-05-07T20:23:53.1974352Z cache_alignment	: 64
2025-05-07T20:23:53.1974694Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:53.1975021Z power management:
2025-05-07T20:23:53.1975156Z 
2025-05-07T20:23:53.1975161Z 
2025-05-07T20:23:53.1975295Z ################################################################################
2025-05-07T20:23:53.1975611Z [INFO] Print PCI info ...
2025-05-07T20:23:53.1975856Z + lspci -v
2025-05-07T20:23:53.1975970Z 
2025-05-07T20:23:53.1976193Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:53.1976597Z 	Subsystem: Amazon.com, Inc. Device 1237
2025-05-07T20:23:53.1976935Z 	Flags: bus master, medium devsel, latency 0
2025-05-07T20:23:53.1977146Z 
2025-05-07T20:23:53.1977355Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:53.1977753Z 	Physical Slot: 1
2025-05-07T20:23:53.1978003Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:53.1978209Z 
2025-05-07T20:23:53.1978461Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:53.1978913Z 	Physical Slot: 1
2025-05-07T20:23:53.1979171Z 	Flags: bus master, fast devsel, latency 0, IRQ 9
2025-05-07T20:23:53.1979401Z 
2025-05-07T20:23:53.1979683Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller])
2025-05-07T20:23:53.1980138Z 	Physical Slot: 3
2025-05-07T20:23:53.1980405Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:53.1980785Z 	Memory at c1000000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:53.1981143Z 	Expansion ROM at 000c0000 [disabled] [size=128K]
2025-05-07T20:23:53.1981383Z 
2025-05-07T20:23:53.1981694Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:53.1982353Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:53.1982655Z 	Physical Slot: 4
2025-05-07T20:23:53.1982918Z 	Flags: bus master, fast devsel, latency 0, IRQ 11
2025-05-07T20:23:53.1983313Z 	Memory at c1808000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:53.1983687Z 	Capabilities: <access denied>
2025-05-07T20:23:53.1983971Z 	Kernel driver in use: nvme
2025-05-07T20:23:53.1984143Z 
2025-05-07T20:23:53.1984461Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:53.1984961Z 	Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:53.1985320Z 	Physical Slot: 5
2025-05-07T20:23:53.1985568Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:53.1985938Z 	Memory at c1804000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:53.1986336Z 	Memory at c1400000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:53.1986668Z 	Capabilities: <access denied>
2025-05-07T20:23:53.1986943Z 	Kernel driver in use: ena
2025-05-07T20:23:53.1987195Z 	Kernel modules: ena
2025-05-07T20:23:53.1987337Z 
2025-05-07T20:23:53.1987509Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:53.1987900Z 	Subsystem: NVIDIA Corporation Device 152f
2025-05-07T20:23:53.1988207Z 	Physical Slot: 30
2025-05-07T20:23:53.1988466Z 	Flags: bus master, fast devsel, latency 0, IRQ 10
2025-05-07T20:23:53.1988864Z 	Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
2025-05-07T20:23:53.1989275Z 	Memory at 1800000000 (64-bit, prefetchable) [size=32G]
2025-05-07T20:23:53.1989667Z 	Memory at 1040000000 (64-bit, prefetchable) [size=32M]
2025-05-07T20:23:53.1990008Z 	Capabilities: <access denied>
2025-05-07T20:23:53.1990288Z 	Kernel driver in use: nvidia
2025-05-07T20:23:53.1990593Z 	Kernel modules: nvidia
2025-05-07T20:23:53.1990748Z 
2025-05-07T20:23:53.1991063Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:53.1991606Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:53.1991907Z 	Physical Slot: 31
2025-05-07T20:23:53.1992149Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:53.1992515Z 	Memory at c1800000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:53.1992923Z 	Memory at c180c000 (32-bit, prefetchable) [size=8K]
2025-05-07T20:23:53.1993270Z 	Capabilities: <access denied>
2025-05-07T20:23:53.1993539Z 	Kernel driver in use: nvme
2025-05-07T20:23:53.1993712Z 
2025-05-07T20:23:53.1993716Z 
2025-05-07T20:23:53.1993837Z ################################################################################
2025-05-07T20:23:53.1994180Z [INFO] Print Linux distribution info ...
2025-05-07T20:23:53.1994472Z + uname -a
2025-05-07T20:23:53.1994602Z 
2025-05-07T20:23:53.1995031Z Linux ip-10-0-35-243.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
2025-05-07T20:23:53.1995562Z 
2025-05-07T20:23:53.1995653Z + uname -m
2025-05-07T20:23:53.1995769Z 
2025-05-07T20:23:53.1995850Z x86_64
2025-05-07T20:23:53.1995960Z 
2025-05-07T20:23:53.1996048Z + cat /proc/version
2025-05-07T20:23:53.1996194Z 
2025-05-07T20:23:53.1996768Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025
2025-05-07T20:23:53.1997436Z 
2025-05-07T20:23:53.1997527Z + cat /etc/os-release
2025-05-07T20:23:53.1997676Z 
2025-05-07T20:23:53.1997779Z NAME="Amazon Linux"
2025-05-07T20:23:53.1997992Z VERSION="2023"
2025-05-07T20:23:53.1998200Z ID="amzn"
2025-05-07T20:23:53.1998461Z ID_LIKE="fedora"
2025-05-07T20:23:53.1998672Z VERSION_ID="2023"
2025-05-07T20:23:53.1998912Z PLATFORM_ID="platform:al2023"
2025-05-07T20:23:53.1999197Z PRETTY_NAME="Amazon Linux 2023.6.20250317"
2025-05-07T20:23:53.1999494Z ANSI_COLOR="0;33"
2025-05-07T20:23:53.1999750Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
2025-05-07T20:23:53.2000257Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
2025-05-07T20:23:53.2000703Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
2025-05-07T20:23:53.2001139Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
2025-05-07T20:23:53.2001601Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
2025-05-07T20:23:53.2001987Z VENDOR_NAME="AWS"
2025-05-07T20:23:53.2002236Z VENDOR_URL="https://aws.amazon.com/"
2025-05-07T20:23:53.2002538Z SUPPORT_END="2029-06-30"
2025-05-07T20:23:53.2002700Z 
2025-05-07T20:23:53.2002908Z ################################################################################
2025-05-07T20:23:53.2003224Z # Print EC2 Instance Info
2025-05-07T20:23:53.2003473Z #
2025-05-07T20:23:53.2003692Z # [2025-05-07T20:23:53.193Z] + print_ec2_info 
2025-05-07T20:23:53.2004008Z ################################################################################
2025-05-07T20:23:53.2004236Z 
2025-05-07T20:23:53.2066783Z ami-id: ami-071226ecf16aa7d96
2025-05-07T20:23:53.2189774Z instance-id: i-02a13dec7b575dc8f
2025-05-07T20:23:53.2303158Z instance-type: g5.4xlarge
2025-05-07T20:23:53.2340515Z ##[group]Run . $PRELUDE; print_gpu_info
2025-05-07T20:23:53.2340894Z [36;1m. $PRELUDE; print_gpu_info[0m
2025-05-07T20:23:53.2351356Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:53.2351726Z env:
2025-05-07T20:23:53.2351961Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:53.2352293Z   BUILD_ENV: build_binary
2025-05-07T20:23:53.2352560Z   BUILD_TARGET: genai
2025-05-07T20:23:53.2352804Z   BUILD_VARIANT: cuda
2025-05-07T20:23:53.2353061Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:53.2353336Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:53.2353647Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:53.2354009Z ##[endgroup]
2025-05-07T20:23:53.5724299Z ################################################################################
2025-05-07T20:23:53.5724804Z [INFO] Printing general display info ...
2025-05-07T20:23:53.5757207Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:53.6811986Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:53.6821090Z /usr/bin/sudo
2025-05-07T20:23:53.6832781Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:53.6843210Z /usr/bin/yum
2025-05-07T20:23:53.6844906Z [INSTALL] Updating system repositories ...
2025-05-07T20:23:53.6866067Z [EXEC] [ATTEMPT 0/3]    + sudo yum update -y
2025-05-07T20:23:54.1547520Z Last metadata expiration check: 0:00:08 ago on Wed May  7 20:23:46 2025.
2025-05-07T20:23:54.2244383Z ================================================================================
2025-05-07T20:23:54.2244751Z WARNING:
2025-05-07T20:23:54.2244989Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:54.2245233Z 
2025-05-07T20:23:54.2245324Z   Available Versions:
2025-05-07T20:23:54.2245468Z 
2025-05-07T20:23:54.2245562Z   Version 2023.7.20250331:
2025-05-07T20:23:54.2245867Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:54.2246163Z 
2025-05-07T20:23:54.2246297Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:54.2246517Z 
2025-05-07T20:23:54.2246604Z     Release notes:
2025-05-07T20:23:54.2247025Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:54.2247411Z 
2025-05-07T20:23:54.2247502Z   Version 2023.7.20250414:
2025-05-07T20:23:54.2247811Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:54.2248063Z 
2025-05-07T20:23:54.2248187Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:54.2248402Z 
2025-05-07T20:23:54.2248492Z     Release notes:
2025-05-07T20:23:54.2248885Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:54.2249265Z 
2025-05-07T20:23:54.2249350Z   Version 2023.7.20250428:
2025-05-07T20:23:54.2249656Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:54.2250169Z 
2025-05-07T20:23:54.2250311Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:54.2250562Z 
2025-05-07T20:23:54.2250654Z     Release notes:
2025-05-07T20:23:54.2251062Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:54.2251439Z 
2025-05-07T20:23:54.2251565Z ================================================================================
2025-05-07T20:23:54.3423473Z Dependencies resolved.
2025-05-07T20:23:54.3712810Z ================================================================================
2025-05-07T20:23:54.3713237Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:54.3713638Z ================================================================================
2025-05-07T20:23:54.3713960Z Upgrading:
2025-05-07T20:23:54.3714320Z  nvidia-container-toolkit      x86_64 1.17.6-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:54.3714932Z  nvidia-container-toolkit-base x86_64 1.17.6-1   nvidia-container-toolkit 5.7 M
2025-05-07T20:23:54.3715330Z 
2025-05-07T20:23:54.3717181Z Transaction Summary
2025-05-07T20:23:54.3717462Z ================================================================================
2025-05-07T20:23:54.3717777Z Upgrade  2 Packages
2025-05-07T20:23:54.3717925Z 
2025-05-07T20:23:54.3718030Z Total download size: 6.9 M
2025-05-07T20:23:54.3718299Z Downloading Packages:
2025-05-07T20:23:54.4063221Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64  37 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:54.4648018Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x  62 MB/s | 5.7 MB     00:00    
2025-05-07T20:23:54.4655624Z --------------------------------------------------------------------------------
2025-05-07T20:23:54.4658505Z Total                                            74 MB/s | 6.9 MB     00:00     
2025-05-07T20:23:54.4660814Z Running transaction check
2025-05-07T20:23:54.4756208Z Transaction check succeeded.
2025-05-07T20:23:54.4756737Z Running transaction test
2025-05-07T20:23:54.5051197Z Transaction test succeeded.
2025-05-07T20:23:54.5054447Z Running transaction
2025-05-07T20:23:55.0567340Z   Preparing        :                                                        1/1 
2025-05-07T20:23:55.1621608Z   Upgrading        : nvidia-container-toolkit-base-1.17.6-1.x86_64          1/4 
2025-05-07T20:23:55.1645224Z   Upgrading        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:55.1841760Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:55.1842653Z   Cleanup          : nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:55.1951050Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:55.1971981Z   Cleanup          : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:55.3421618Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               4/4 
2025-05-07T20:23:55.3422244Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               1/4 
2025-05-07T20:23:55.3422848Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:55.3423399Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          3/4 
2025-05-07T20:23:55.4803798Z ================================================================================
2025-05-07T20:23:55.4804173Z WARNING:
2025-05-07T20:23:55.4804423Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:55.4804658Z 
2025-05-07T20:23:55.4804763Z   Available Versions:
2025-05-07T20:23:55.4804913Z 
2025-05-07T20:23:55.4805004Z   Version 2023.7.20250331:
2025-05-07T20:23:55.4805325Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:55.4805584Z 
2025-05-07T20:23:55.4805715Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:55.4805929Z 
2025-05-07T20:23:55.4806015Z     Release notes:
2025-05-07T20:23:55.4806436Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:55.4807092Z 
2025-05-07T20:23:55.4807198Z   Version 2023.7.20250414:
2025-05-07T20:23:55.4807516Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:55.4807773Z 
2025-05-07T20:23:55.4807894Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:55.4808117Z 
2025-05-07T20:23:55.4808203Z     Release notes:
2025-05-07T20:23:55.4808613Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:55.4808995Z 
2025-05-07T20:23:55.4809085Z   Version 2023.7.20250428:
2025-05-07T20:23:55.4809405Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:55.4809669Z 
2025-05-07T20:23:55.4809785Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:55.4810001Z 
2025-05-07T20:23:55.4810095Z     Release notes:
2025-05-07T20:23:55.4810498Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:55.4810889Z 
2025-05-07T20:23:55.4811201Z ================================================================================
2025-05-07T20:23:55.5376535Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:55.5376880Z 
2025-05-07T20:23:55.5376977Z Upgraded:
2025-05-07T20:23:55.5377317Z   nvidia-container-toolkit-1.17.6-1.x86_64                                      
2025-05-07T20:23:55.5377900Z   nvidia-container-toolkit-base-1.17.6-1.x86_64                                 
2025-05-07T20:23:55.5378248Z 
2025-05-07T20:23:55.5378338Z Complete!
2025-05-07T20:23:55.5843223Z [INSTALL] Installing system package(s): hostname lshw ...
2025-05-07T20:23:55.5866806Z [EXEC] [ATTEMPT 0/3]    + sudo yum install -y hostname lshw
2025-05-07T20:23:56.0437326Z Last metadata expiration check: 0:00:10 ago on Wed May  7 20:23:46 2025.
2025-05-07T20:23:56.0678033Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:56.1091611Z Dependencies resolved.
2025-05-07T20:23:56.1268961Z ================================================================================
2025-05-07T20:23:56.1269433Z  Package    Architecture Version                        Repository         Size
2025-05-07T20:23:56.1269863Z ================================================================================
2025-05-07T20:23:56.1270171Z Installing:
2025-05-07T20:23:56.1270474Z  lshw       x86_64       B.02.19.2-7.amzn2023.0.3       amazonlinux       319 k
2025-05-07T20:23:56.1270765Z 
2025-05-07T20:23:56.1270864Z Transaction Summary
2025-05-07T20:23:56.1271134Z ================================================================================
2025-05-07T20:23:56.1271444Z Install  1 Package
2025-05-07T20:23:56.1271581Z 
2025-05-07T20:23:56.1271702Z Total download size: 319 k
2025-05-07T20:23:56.1272310Z Installed size: 837 k
2025-05-07T20:23:56.1273733Z Downloading Packages:
2025-05-07T20:23:56.2043522Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm        6.6 MB/s | 319 kB     00:00    
2025-05-07T20:23:56.2049045Z --------------------------------------------------------------------------------
2025-05-07T20:23:56.2051857Z Total                                           4.0 MB/s | 319 kB     00:00     
2025-05-07T20:23:56.2210505Z Running transaction check
2025-05-07T20:23:56.2264554Z Transaction check succeeded.
2025-05-07T20:23:56.2264958Z Running transaction test
2025-05-07T20:23:56.2723862Z Transaction test succeeded.
2025-05-07T20:23:56.2728196Z Running transaction
2025-05-07T20:23:56.3766730Z   Preparing        :                                                        1/1 
2025-05-07T20:23:56.4277955Z   Installing       : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:56.6154024Z   Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:56.7549288Z ================================================================================
2025-05-07T20:23:56.7549713Z WARNING:
2025-05-07T20:23:56.7549961Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:56.7550532Z 
2025-05-07T20:23:56.7550633Z   Available Versions:
2025-05-07T20:23:56.7550801Z 
2025-05-07T20:23:56.7550903Z   Version 2023.7.20250331:
2025-05-07T20:23:56.7551268Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:56.7551539Z 
2025-05-07T20:23:56.7551664Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:56.7551890Z 
2025-05-07T20:23:56.7551978Z     Release notes:
2025-05-07T20:23:56.7552401Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:56.7552790Z 
2025-05-07T20:23:56.7552878Z   Version 2023.7.20250414:
2025-05-07T20:23:56.7553192Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:56.7553445Z 
2025-05-07T20:23:56.7553569Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:56.7553783Z 
2025-05-07T20:23:56.7553874Z     Release notes:
2025-05-07T20:23:56.7554268Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:56.7554660Z 
2025-05-07T20:23:56.7554909Z   Version 2023.7.20250428:
2025-05-07T20:23:56.7555229Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:56.7555483Z 
2025-05-07T20:23:56.7555600Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:56.7555823Z 
2025-05-07T20:23:56.7555914Z     Release notes:
2025-05-07T20:23:56.7556323Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:56.7556700Z 
2025-05-07T20:23:56.7556819Z ================================================================================
2025-05-07T20:23:56.7895790Z   Verifying        : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:56.7896139Z 
2025-05-07T20:23:56.7896230Z Installed:
2025-05-07T20:23:56.7896551Z   lshw-B.02.19.2-7.amzn2023.0.3.x86_64                                          
2025-05-07T20:23:56.7896851Z 
2025-05-07T20:23:56.7896943Z Complete!
2025-05-07T20:23:56.8376743Z + hostname
2025-05-07T20:23:56.8376890Z 
2025-05-07T20:23:56.8390354Z ip-10-0-35-243.ec2.internal
2025-05-07T20:23:56.8391356Z 
2025-05-07T20:23:56.8391989Z + sudo lshw -C display
2025-05-07T20:23:56.8392138Z 
2025-05-07T20:23:57.2437613Z   *-display:0 UNCLAIMED
2025-05-07T20:23:57.2437932Z        description: VGA compatible controller
2025-05-07T20:23:57.2438275Z        product: Amazon.com, Inc.
2025-05-07T20:23:57.2438559Z        vendor: Amazon.com, Inc.
2025-05-07T20:23:57.2438827Z        physical id: 3
2025-05-07T20:23:57.2439075Z        bus info: pci@0000:00:03.0
2025-05-07T20:23:57.2439342Z        version: 00
2025-05-07T20:23:57.2439565Z        width: 32 bits
2025-05-07T20:23:57.2447923Z        clock: 33MHz
2025-05-07T20:23:57.2448225Z        capabilities: vga_controller bus_master
2025-05-07T20:23:57.2448567Z        configuration: latency=0
2025-05-07T20:23:57.2448910Z        resources: memory:c1000000-c13fffff memory:c0000-dffff
2025-05-07T20:23:57.2449254Z   *-display:1
2025-05-07T20:23:57.2449518Z        description: 3D controller
2025-05-07T20:23:57.2449823Z        product: GA102GL [A10G]
2025-05-07T20:23:57.2450095Z        vendor: NVIDIA Corporation
2025-05-07T20:23:57.2450374Z        physical id: 1e
2025-05-07T20:23:57.2450621Z        bus info: pci@0000:00:1e.0
2025-05-07T20:23:57.2450880Z        version: a1
2025-05-07T20:23:57.2451102Z        width: 64 bits
2025-05-07T20:23:57.2451332Z        clock: 33MHz
2025-05-07T20:23:57.2451623Z        capabilities: pm pciexpress msix bus_master cap_list
2025-05-07T20:23:57.2452013Z        configuration: driver=nvidia latency=0
2025-05-07T20:23:57.2452665Z        resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff
2025-05-07T20:23:57.2475689Z 
2025-05-07T20:23:57.2475954Z ################################################################################
2025-05-07T20:23:57.2476318Z [INFO] Printing NVIDIA GPU info ...
2025-05-07T20:23:57.2612517Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:57.2781176Z Wed May  7 20:23:57 2025       
2025-05-07T20:23:57.2781595Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:57.2782127Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:57.2782625Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:57.2783136Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:57.2783689Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:57.2784136Z |                                         |                        |               MIG M. |
2025-05-07T20:23:57.2784490Z |=========================================+========================+======================|
2025-05-07T20:23:57.2864493Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:57.2865186Z |  0%   31C    P0             60W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:57.2865582Z |                                         |                        |                  N/A |
2025-05-07T20:23:57.2865990Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:57.2866401Z                                                                                          
2025-05-07T20:23:57.2866813Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:57.2867248Z | Processes:                                                                              |
2025-05-07T20:23:57.2867709Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:57.2868145Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:57.2868520Z |=========================================================================================|
2025-05-07T20:23:57.2869653Z |  No running processes found                                                             |
2025-05-07T20:23:57.2870145Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:57.4307908Z ################################################################################
2025-05-07T20:23:57.4308262Z [INFO] Printing AMD GPU info ...
2025-05-07T20:23:57.4450597Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:57.4451323Z [CHECK] rocminfo not found
2025-05-07T20:23:57.4461190Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:57.4462186Z [CHECK] rocm-smi not found
2025-05-07T20:23:57.4499291Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda
2025-05-07T20:23:57.4499749Z [36;1m. $PRELUDE; setup_miniconda $HOME/miniconda[0m
2025-05-07T20:23:57.4511643Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:57.4512066Z env:
2025-05-07T20:23:57.4512375Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:57.4512694Z   BUILD_ENV: build_binary
2025-05-07T20:23:57.4512951Z   BUILD_TARGET: genai
2025-05-07T20:23:57.4513184Z   BUILD_VARIANT: cuda
2025-05-07T20:23:57.4513417Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:23:57.4513685Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:57.4513991Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:57.4514331Z ##[endgroup]
2025-05-07T20:23:57.7889432Z ################################################################################
2025-05-07T20:23:57.7889809Z # Setup Miniconda
2025-05-07T20:23:57.7890033Z #
2025-05-07T20:23:57.7906265Z # [2025-05-07T20:23:57.790Z] + setup_miniconda /home/ec2-user/miniconda
2025-05-07T20:23:57.7906676Z ################################################################################
2025-05-07T20:23:57.7906903Z 
2025-05-07T20:23:57.7921533Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:57.8810753Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:57.8811144Z + mkdir -p /home/ec2-user/miniconda
2025-05-07T20:23:57.8811345Z 
2025-05-07T20:23:57.8829239Z 
2025-05-07T20:23:57.8829573Z [SETUP] Downloading the Miniconda installer ...
2025-05-07T20:23:57.8852372Z [EXEC] [ATTEMPT 0/3]    + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
2025-05-07T20:23:58.6429154Z [SETUP] Installing Miniconda ...
2025-05-07T20:23:58.6429574Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u
2025-05-07T20:23:58.6429836Z 
2025-05-07T20:23:58.6575855Z PREFIX=/home/ec2-user/miniconda
2025-05-07T20:23:59.1090795Z Unpacking payload ...
2025-05-07T20:23:59.6259102Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:24:00.4275005Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:24:02.5521178Z 
2025-05-07T20:24:02.5521709Z Installing base environment...
2025-05-07T20:24:02.5521940Z 
2025-05-07T20:24:03.6317761Z Preparing transaction: ...working... done
2025-05-07T20:24:06.6632137Z Executing transaction: ...working... done
2025-05-07T20:24:07.3545178Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:24:07.4468349Z installation finished.
2025-05-07T20:24:07.4475682Z 
2025-05-07T20:24:07.4475890Z + rm -f miniconda.sh
2025-05-07T20:24:07.4476080Z 
2025-05-07T20:24:07.4784267Z 
2025-05-07T20:24:07.4784579Z [SETUP] Reloading the bash configuration ...
2025-05-07T20:24:07.4784952Z + /home/ec2-user/miniconda/bin/conda init bash
2025-05-07T20:24:07.4785206Z 
2025-05-07T20:24:07.8468699Z no change     /home/ec2-user/miniconda/condabin/conda
2025-05-07T20:24:07.8469218Z no change     /home/ec2-user/miniconda/bin/conda
2025-05-07T20:24:07.8469735Z no change     /home/ec2-user/miniconda/bin/conda-env
2025-05-07T20:24:07.8470206Z no change     /home/ec2-user/miniconda/bin/activate
2025-05-07T20:24:07.8470677Z no change     /home/ec2-user/miniconda/bin/deactivate
2025-05-07T20:24:07.8471142Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.sh
2025-05-07T20:24:07.8471595Z no change     /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish
2025-05-07T20:24:07.8472048Z no change     /home/ec2-user/miniconda/shell/condabin/Conda.psm1
2025-05-07T20:24:07.8472524Z no change     /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1
2025-05-07T20:24:07.8473604Z no change     /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh
2025-05-07T20:24:07.8474146Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.csh
2025-05-07T20:24:07.8474545Z modified      /home/ec2-user/.bashrc
2025-05-07T20:24:07.8474742Z 
2025-05-07T20:24:07.8474952Z ==> For changes to take effect, close and re-open your current shell. <==
2025-05-07T20:24:07.8475259Z 
2025-05-07T20:24:07.9204642Z 
2025-05-07T20:24:07.9210637Z + . /home/ec2-user/.bashrc
2025-05-07T20:24:07.9210817Z 
2025-05-07T20:24:08.7738499Z 
2025-05-07T20:24:08.7739130Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ...
2025-05-07T20:24:08.7762210Z [EXEC] [ATTEMPT 0/3]    + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive
2025-05-07T20:24:22.2591405Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:24:23.8994694Z Solving environment: \ | / - \ | / - \ | / - done
2025-05-07T20:24:23.9955161Z 
2025-05-07T20:24:23.9955363Z ## Package Plan ##
2025-05-07T20:24:23.9955585Z 
2025-05-07T20:24:23.9955769Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:24:23.9956044Z 
2025-05-07T20:24:23.9956143Z   added / updated specs:
2025-05-07T20:24:23.9956411Z     - conda-libmamba-solver
2025-05-07T20:24:23.9956658Z     - libarchive
2025-05-07T20:24:23.9956878Z     - libmamba
2025-05-07T20:24:23.9957087Z     - libmambapy
2025-05-07T20:24:23.9957212Z 
2025-05-07T20:24:23.9957217Z 
2025-05-07T20:24:23.9957357Z The following packages will be downloaded:
2025-05-07T20:24:23.9957573Z 
2025-05-07T20:24:23.9957687Z     package                    |            build
2025-05-07T20:24:23.9958015Z     ---------------------------|-----------------
2025-05-07T20:24:23.9958438Z     ca-certificates-2025.4.26  |       hbd8a1cb_0         149 KB  conda-forge
2025-05-07T20:24:23.9958912Z     certifi-2025.4.26          |     pyhd8ed1ab_0         154 KB  conda-forge
2025-05-07T20:24:23.9959346Z     conda-25.3.1               |  py313h78bf25f_1         1.1 MB  conda-forge
2025-05-07T20:24:23.9959862Z     conda-libmamba-solver-25.4.0|     pyhd8ed1ab_0          41 KB  conda-forge
2025-05-07T20:24:23.9960424Z     ------------------------------------------------------------
2025-05-07T20:24:23.9960764Z                                            Total:         1.4 MB
2025-05-07T20:24:23.9960981Z 
2025-05-07T20:24:23.9961094Z The following packages will be UPDATED:
2025-05-07T20:24:23.9961299Z 
2025-05-07T20:24:23.9964983Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:23.9965795Z   conda              pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 
2025-05-07T20:24:23.9966328Z 
2025-05-07T20:24:23.9966556Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:24:23.9966888Z 
2025-05-07T20:24:23.9967215Z   certifi            pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 
2025-05-07T20:24:23.9968042Z   conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 
2025-05-07T20:24:23.9968546Z 
2025-05-07T20:24:23.9968550Z 
2025-05-07T20:24:23.9968554Z 
2025-05-07T20:24:23.9968704Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:23.9969092Z conda-25.3.1         | 1.1 MB    |            |   0% 
2025-05-07T20:24:23.9969319Z 
2025-05-07T20:24:23.9969822Z certifi-2025.4.26    | 154 KB    |            |   0% [A
2025-05-07T20:24:23.9970078Z 
2025-05-07T20:24:23.9970082Z 
2025-05-07T20:24:23.9971778Z ca-certificates-2025 | 149 KB    |            |   0% [A[A
2025-05-07T20:24:23.9972126Z 
2025-05-07T20:24:23.9972149Z 
2025-05-07T20:24:23.9972154Z 
2025-05-07T20:24:24.0535989Z conda-libmamba-solve | 41 KB     |            |   0% [A[A[A
2025-05-07T20:24:24.0756318Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:24.0756580Z 
2025-05-07T20:24:24.0756584Z 
2025-05-07T20:24:24.0756588Z 
2025-05-07T20:24:24.0915161Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:24.0915465Z 
2025-05-07T20:24:24.0916185Z 
2025-05-07T20:24:24.1025877Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:24:24.1026164Z 
2025-05-07T20:24:24.1026168Z 
2025-05-07T20:24:24.1026172Z 
2025-05-07T20:24:24.1034116Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:24.1034468Z 
2025-05-07T20:24:24.1034472Z 
2025-05-07T20:24:24.1034502Z 
2025-05-07T20:24:24.1090795Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:24.1091241Z 
2025-05-07T20:24:24.1157962Z certifi-2025.4.26    | 154 KB    | #          |  10% [A
2025-05-07T20:24:24.1158584Z 
2025-05-07T20:24:24.1158590Z 
2025-05-07T20:24:24.1163508Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:24:24.1163848Z 
2025-05-07T20:24:24.1170262Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:24:24.1170539Z 
2025-05-07T20:24:24.1170544Z 
2025-05-07T20:24:24.1311744Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:24:24.1312137Z 
2025-05-07T20:24:24.2020893Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:24:24.2021312Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:24.2026774Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:24.2027243Z                                                      
2025-05-07T20:24:24.2027660Z 
2025-05-07T20:24:24.2027898Z                                                      [A
2025-05-07T20:24:24.2028152Z 
2025-05-07T20:24:24.2028176Z 
2025-05-07T20:24:24.2028405Z                                                      [A[A
2025-05-07T20:24:24.2028648Z 
2025-05-07T20:24:24.2028653Z 
2025-05-07T20:24:24.2028658Z 
2025-05-07T20:24:24.2028841Z                                                      [A[A[A done
2025-05-07T20:24:24.3033970Z Preparing transaction: | done
2025-05-07T20:24:24.4039961Z Verifying transaction: - done
2025-05-07T20:24:25.8061488Z Executing transaction: | / - \ | / - \ | / - \ | / done
2025-05-07T20:24:27.6603394Z [SETUP] Updating Miniconda base packages ...
2025-05-07T20:24:27.6630112Z [EXEC] [ATTEMPT 0/3]    + conda update -n base -c defaults --update-deps -y conda
2025-05-07T20:24:28.6032156Z Channels:
2025-05-07T20:24:28.6032407Z  - defaults
2025-05-07T20:24:28.6032627Z Platform: linux-64
2025-05-07T20:24:29.8626682Z Collecting package metadata (repodata.json): - \ | / - \ | / done
2025-05-07T20:24:29.9840081Z Solving environment: \ | Channels:
2025-05-07T20:24:29.9840688Z  - defaults
2025-05-07T20:24:29.9841135Z Platform: linux-64
2025-05-07T20:24:30.2647559Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:24:30.4804144Z Solving environment: \ | / - done
2025-05-07T20:24:30.5628407Z done
2025-05-07T20:24:30.6287357Z 
2025-05-07T20:24:30.6287939Z ## Package Plan ##
2025-05-07T20:24:30.6288213Z 
2025-05-07T20:24:30.6288466Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:24:30.6288923Z 
2025-05-07T20:24:30.6289098Z   added / updated specs:
2025-05-07T20:24:30.6289543Z     - conda
2025-05-07T20:24:30.6289754Z 
2025-05-07T20:24:30.6289762Z 
2025-05-07T20:24:30.6289979Z The following packages will be downloaded:
2025-05-07T20:24:30.6290386Z 
2025-05-07T20:24:30.6290542Z     package                    |            build
2025-05-07T20:24:30.6291160Z     ---------------------------|-----------------
2025-05-07T20:24:30.6291525Z     pip-25.1                   |     pyhc872135_2         1.3 MB
2025-05-07T20:24:30.6291913Z     tzdata-2025b               |       h04d1e81_0         116 KB
2025-05-07T20:24:30.6292324Z     ------------------------------------------------------------
2025-05-07T20:24:30.6292673Z                                            Total:         1.4 MB
2025-05-07T20:24:30.6292887Z 
2025-05-07T20:24:30.6293012Z The following packages will be UPDATED:
2025-05-07T20:24:30.6293224Z 
2025-05-07T20:24:30.6293528Z   pip                pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:30.6294057Z   tzdata                                   2025a-h04d1e81_0 --> 2025b-h04d1e81_0 
2025-05-07T20:24:30.6294315Z 
2025-05-07T20:24:30.6294319Z 
2025-05-07T20:24:30.6294339Z 
2025-05-07T20:24:30.6294487Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:30.6294969Z pip-25.1             | 1.3 MB    |            |   0% 
2025-05-07T20:24:30.6295189Z 
2025-05-07T20:24:30.6572085Z tzdata-2025b         | 116 KB    |            |   0% [A
2025-05-07T20:24:30.6572439Z 
2025-05-07T20:24:30.7109257Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:30.8944311Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:30.8944574Z 
2025-05-07T20:24:30.8947311Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:30.8947561Z 
2025-05-07T20:24:30.9275245Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:30.9275815Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:30.9280091Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:30.9280545Z                                                      
2025-05-07T20:24:30.9280765Z 
2025-05-07T20:24:30.9281303Z                                                      [A done
2025-05-07T20:24:31.0286921Z Preparing transaction: | done
2025-05-07T20:24:31.1294151Z Verifying transaction: - done
2025-05-07T20:24:33.2320850Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:24:33.8550662Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:24:33.8554712Z + conda clean --packages --tarball -y
2025-05-07T20:24:33.8554933Z 
2025-05-07T20:24:34.8952057Z Will remove 99 (117.8 MB) tarball(s).
2025-05-07T20:24:34.8952449Z Will remove 11 (16.0 MB) package(s).
2025-05-07T20:24:34.9630924Z 
2025-05-07T20:24:34.9638684Z + conda clean --all -y
2025-05-07T20:24:34.9638926Z 
2025-05-07T20:24:35.5061360Z There are no unused tarball(s) to remove.
2025-05-07T20:24:35.5061822Z Will remove 1 index cache(s).
2025-05-07T20:24:35.5062252Z There are no unused package(s) to remove.
2025-05-07T20:24:35.5062579Z There are no tempfile(s) to remove.
2025-05-07T20:24:35.5062874Z There are no logfile(s) to remove.
2025-05-07T20:24:35.5738118Z 
2025-05-07T20:24:35.5743726Z + conda info
2025-05-07T20:24:35.5743852Z 
2025-05-07T20:24:36.3511792Z 
2025-05-07T20:24:36.3512512Z      active environment : base
2025-05-07T20:24:36.3513289Z     active env location : /home/ec2-user/miniconda
2025-05-07T20:24:36.3513945Z             shell level : 1
2025-05-07T20:24:36.3514532Z        user config file : /home/ec2-user/.condarc
2025-05-07T20:24:36.3515316Z  populated config files : /home/ec2-user/miniconda/.condarc
2025-05-07T20:24:36.3516041Z           conda version : 25.3.1
2025-05-07T20:24:36.3516600Z     conda-build version : not installed
2025-05-07T20:24:36.3517198Z          python version : 3.13.2.final.0
2025-05-07T20:24:36.3517789Z                  solver : libmamba (default)
2025-05-07T20:24:36.3518404Z        virtual packages : __archspec=1=zen2
2025-05-07T20:24:36.3518999Z                           __conda=25.3.1=0
2025-05-07T20:24:36.3519558Z                           __cuda=12.8=0
2025-05-07T20:24:36.3520104Z                           __glibc=2.34=0
2025-05-07T20:24:36.3520661Z                           __linux=6.1.130=0
2025-05-07T20:24:36.3521711Z                           __unix=0=0
2025-05-07T20:24:36.3522215Z        base environment : /home/ec2-user/miniconda  (writable)
2025-05-07T20:24:36.3522654Z       conda av data dir : /home/ec2-user/miniconda/etc/conda
2025-05-07T20:24:36.3523011Z   conda av metadata url : None
2025-05-07T20:24:36.3523378Z            channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
2025-05-07T20:24:36.3523821Z                           https://repo.anaconda.com/pkgs/main/noarch
2025-05-07T20:24:36.3524213Z                           https://repo.anaconda.com/pkgs/r/linux-64
2025-05-07T20:24:36.3524599Z                           https://repo.anaconda.com/pkgs/r/noarch
2025-05-07T20:24:36.3524966Z           package cache : /home/ec2-user/miniconda/pkgs
2025-05-07T20:24:36.3525309Z                           /home/ec2-user/.conda/pkgs
2025-05-07T20:24:36.3525919Z        envs directories : /home/ec2-user/miniconda/envs
2025-05-07T20:24:36.3526264Z                           /home/ec2-user/.conda/envs
2025-05-07T20:24:36.3526584Z                platform : linux-64
2025-05-07T20:24:36.3527438Z              user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/.
2025-05-07T20:24:36.3528486Z                 UID:GID : 1000:1000
2025-05-07T20:24:36.3528764Z              netrc file : None
2025-05-07T20:24:36.3529035Z            offline mode : False
2025-05-07T20:24:36.3529208Z 
2025-05-07T20:24:36.4211810Z 
2025-05-07T20:24:36.4212274Z [SETUP] Exporting Miniconda variables ...
2025-05-07T20:24:36.4213033Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_4041281f-d965-4d6b-b629-f6367ffc8ef1 ...
2025-05-07T20:24:36.4213843Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda
2025-05-07T20:24:36.4303028Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.12
2025-05-07T20:24:36.4303526Z [36;1m. $PRELUDE; create_conda_environment $BUILD_ENV 3.12[0m
2025-05-07T20:24:36.4323152Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:24:36.4323517Z env:
2025-05-07T20:24:36.4323772Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:24:36.4324080Z   BUILD_ENV: build_binary
2025-05-07T20:24:36.4324339Z   BUILD_TARGET: genai
2025-05-07T20:24:36.4324577Z   BUILD_VARIANT: cuda
2025-05-07T20:24:36.4324810Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:24:36.4325075Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:24:36.4325715Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:24:36.4326129Z ##[endgroup]
2025-05-07T20:24:36.7755616Z ################################################################################
2025-05-07T20:24:36.7756009Z # Create Conda Environment
2025-05-07T20:24:36.7756264Z #
2025-05-07T20:24:36.7771710Z # [2025-05-07T20:24:36.776Z] + create_conda_environment build_binary 3.12
2025-05-07T20:24:36.7772189Z ################################################################################
2025-05-07T20:24:36.7772415Z 
2025-05-07T20:24:36.7788694Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:24:37.0230155Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:24:37.0230555Z [SETUP] Listing existing Conda environments ...
2025-05-07T20:24:37.0230902Z + conda info --envs
2025-05-07T20:24:37.0231041Z 
2025-05-07T20:24:37.7782266Z 
2025-05-07T20:24:37.7782925Z # conda environments:
2025-05-07T20:24:37.7783213Z #
2025-05-07T20:24:37.7783447Z base                   /home/ec2-user/miniconda
2025-05-07T20:24:37.7783679Z 
2025-05-07T20:24:37.8489053Z 
2025-05-07T20:24:37.8489640Z [SETUP] Deleting the prefix directory if it exists ...
2025-05-07T20:24:39.4956041Z + rm -rf /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:39.4956805Z 
2025-05-07T20:24:39.4972470Z 
2025-05-07T20:24:39.4982730Z [SETUP] Creating new Conda environment (Python 3.12) ...
2025-05-07T20:24:39.5003755Z [EXEC] [ATTEMPT 0/3]    + conda create -y -n build_binary python=3.12
2025-05-07T20:24:40.2586772Z Channels:
2025-05-07T20:24:40.2587019Z  - defaults
2025-05-07T20:24:40.2587231Z Platform: linux-64
2025-05-07T20:24:41.8329746Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | done
2025-05-07T20:24:41.9334818Z Solving environment: - done
2025-05-07T20:24:41.9625260Z 
2025-05-07T20:24:41.9625821Z ## Package Plan ##
2025-05-07T20:24:41.9626100Z 
2025-05-07T20:24:41.9626393Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:41.9626825Z 
2025-05-07T20:24:41.9626959Z   added / updated specs:
2025-05-07T20:24:41.9627240Z     - python=3.12
2025-05-07T20:24:41.9627373Z 
2025-05-07T20:24:41.9627377Z 
2025-05-07T20:24:41.9627496Z The following packages will be downloaded:
2025-05-07T20:24:41.9627722Z 
2025-05-07T20:24:41.9627868Z     package                    |            build
2025-05-07T20:24:41.9628194Z     ---------------------------|-----------------
2025-05-07T20:24:41.9628563Z     _libgcc_mutex-0.1          |             main           3 KB
2025-05-07T20:24:41.9628965Z     _openmp_mutex-5.1          |            1_gnu          21 KB
2025-05-07T20:24:41.9629957Z     ca-certificates-2025.2.25  |       h06a4308_0         129 KB
2025-05-07T20:24:41.9630532Z     python-3.12.9              |       h5148396_0        34.7 MB
2025-05-07T20:24:41.9631052Z     setuptools-78.1.1          |  py312h06a4308_0         2.2 MB
2025-05-07T20:24:41.9631460Z     wheel-0.45.1               |  py312h06a4308_0         147 KB
2025-05-07T20:24:41.9631834Z     ------------------------------------------------------------
2025-05-07T20:24:41.9632169Z                                            Total:        37.2 MB
2025-05-07T20:24:41.9632390Z 
2025-05-07T20:24:41.9632518Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:41.9632748Z 
2025-05-07T20:24:41.9633196Z   _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
2025-05-07T20:24:41.9633651Z   _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
2025-05-07T20:24:41.9634068Z   bzip2              pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 
2025-05-07T20:24:41.9634554Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 
2025-05-07T20:24:41.9635055Z   expat              pkgs/main/linux-64::expat-2.7.1-h6a678d5_0 
2025-05-07T20:24:41.9635509Z   ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
2025-05-07T20:24:41.9635966Z   libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
2025-05-07T20:24:41.9636410Z   libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
2025-05-07T20:24:41.9636926Z   libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
2025-05-07T20:24:41.9637591Z   libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
2025-05-07T20:24:41.9638233Z   libuuid            pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 
2025-05-07T20:24:41.9638665Z   ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
2025-05-07T20:24:41.9639091Z   openssl            pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 
2025-05-07T20:24:41.9639493Z   pip                pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:41.9639901Z   python             pkgs/main/linux-64::python-3.12.9-h5148396_0 
2025-05-07T20:24:41.9640333Z   readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
2025-05-07T20:24:41.9640815Z   setuptools         pkgs/main/linux-64::setuptools-78.1.1-py312h06a4308_0 
2025-05-07T20:24:41.9641281Z   sqlite             pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 
2025-05-07T20:24:41.9641670Z   tk                 pkgs/main/linux-64::tk-8.6.14-h39e8969_0 
2025-05-07T20:24:41.9642052Z   tzdata             pkgs/main/noarch::tzdata-2025b-h04d1e81_0 
2025-05-07T20:24:41.9642472Z   wheel              pkgs/main/linux-64::wheel-0.45.1-py312h06a4308_0 
2025-05-07T20:24:41.9642868Z   xz                 pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 
2025-05-07T20:24:41.9643239Z   zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 
2025-05-07T20:24:41.9643483Z 
2025-05-07T20:24:41.9643493Z 
2025-05-07T20:24:41.9643497Z 
2025-05-07T20:24:41.9643637Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:41.9644028Z python-3.12.9        | 34.7 MB   |            |   0% 
2025-05-07T20:24:41.9644257Z 
2025-05-07T20:24:41.9644585Z setuptools-78.1.1    | 2.2 MB    |            |   0% [A
2025-05-07T20:24:41.9644837Z 
2025-05-07T20:24:41.9644841Z 
2025-05-07T20:24:41.9660156Z wheel-0.45.1         | 147 KB    |            |   0% [A[A
2025-05-07T20:24:41.9660508Z 
2025-05-07T20:24:41.9660514Z 
2025-05-07T20:24:41.9660519Z 
2025-05-07T20:24:41.9670802Z ca-certificates-2025 | 129 KB    |            |   0% [A[A[A
2025-05-07T20:24:41.9671204Z 
2025-05-07T20:24:41.9671209Z 
2025-05-07T20:24:41.9671214Z 
2025-05-07T20:24:41.9671219Z 
2025-05-07T20:24:41.9689992Z _openmp_mutex-5.1    | 21 KB     |            |   0% [A[A[A[A
2025-05-07T20:24:41.9690382Z 
2025-05-07T20:24:41.9690387Z 
2025-05-07T20:24:41.9690392Z 
2025-05-07T20:24:41.9690398Z 
2025-05-07T20:24:41.9692726Z 
2025-05-07T20:24:42.0119049Z _libgcc_mutex-0.1    | 3 KB      |            |   0% [A[A[A[A[A
2025-05-07T20:24:42.0119345Z 
2025-05-07T20:24:42.0121769Z 
2025-05-07T20:24:42.0206230Z wheel-0.45.1         | 147 KB    | ########## | 100% [A[A
2025-05-07T20:24:42.0206572Z 
2025-05-07T20:24:42.0206578Z 
2025-05-07T20:24:42.0206593Z 
2025-05-07T20:24:42.0211784Z 
2025-05-07T20:24:42.0339533Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:42.0339915Z 
2025-05-07T20:24:42.0339920Z 
2025-05-07T20:24:42.0339936Z 
2025-05-07T20:24:42.0431144Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:42.0431546Z 
2025-05-07T20:24:42.0431551Z 
2025-05-07T20:24:42.0431556Z 
2025-05-07T20:24:42.0431561Z 
2025-05-07T20:24:42.0431577Z 
2025-05-07T20:24:42.0631991Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:42.0632605Z python-3.12.9        | 34.7 MB   | 5          |   5% 
2025-05-07T20:24:42.0632942Z 
2025-05-07T20:24:42.0751690Z setuptools-78.1.1    | 2.2 MB    | ####5      |  46% [A
2025-05-07T20:24:42.0752057Z 
2025-05-07T20:24:42.0752063Z 
2025-05-07T20:24:42.0752659Z 
2025-05-07T20:24:42.0760983Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:42.0761389Z 
2025-05-07T20:24:42.0761395Z 
2025-05-07T20:24:42.0761401Z 
2025-05-07T20:24:42.0970775Z ca-certificates-2025 | 129 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:42.0971183Z 
2025-05-07T20:24:42.0971188Z 
2025-05-07T20:24:42.0971194Z 
2025-05-07T20:24:42.0971198Z 
2025-05-07T20:24:42.0971203Z 
2025-05-07T20:24:42.0982652Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:42.0983048Z 
2025-05-07T20:24:42.0983054Z 
2025-05-07T20:24:42.0983059Z 
2025-05-07T20:24:42.0983064Z 
2025-05-07T20:24:42.0985470Z 
2025-05-07T20:24:42.1299249Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:42.1299647Z 
2025-05-07T20:24:42.1299652Z 
2025-05-07T20:24:42.1299657Z 
2025-05-07T20:24:42.1300443Z 
2025-05-07T20:24:42.1309141Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:42.1309527Z 
2025-05-07T20:24:42.1309533Z 
2025-05-07T20:24:42.1309538Z 
2025-05-07T20:24:42.1309543Z 
2025-05-07T20:24:42.1327539Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:24:42.1329429Z 
2025-05-07T20:24:42.1440892Z setuptools-78.1.1    | 2.2 MB    | ########## | 100% [A
2025-05-07T20:24:42.1441158Z 
2025-05-07T20:24:42.1441162Z 
2025-05-07T20:24:42.1442010Z wheel-0.45.1         | 147 KB    | ########## | 100% [A[A
2025-05-07T20:24:42.1442262Z 
2025-05-07T20:24:42.1442266Z 
2025-05-07T20:24:42.1690864Z wheel-0.45.1         | 147 KB    | ########## | 100% [A[A
2025-05-07T20:24:42.2691134Z python-3.12.9        | 34.7 MB   | #8         |  18% 
2025-05-07T20:24:42.3692002Z python-3.12.9        | 34.7 MB   | ###6       |  36% 
2025-05-07T20:24:42.4593679Z python-3.12.9        | 34.7 MB   | #######4   |  75% 
2025-05-07T20:24:42.4593944Z 
2025-05-07T20:24:42.4595939Z setuptools-78.1.1    | 2.2 MB    | ########## | 100% [A
2025-05-07T20:24:42.4596206Z 
2025-05-07T20:24:42.4755425Z setuptools-78.1.1    | 2.2 MB    | ########## | 100% [A
2025-05-07T20:24:42.4755826Z python-3.12.9        | 34.7 MB   | ########## | 100% 
2025-05-07T20:24:43.0996127Z python-3.12.9        | 34.7 MB   | ########## | 100% 
2025-05-07T20:24:43.1003065Z python-3.12.9        | 34.7 MB   | ########## | 100% 
2025-05-07T20:24:43.1003429Z                                                      
2025-05-07T20:24:43.1003636Z 
2025-05-07T20:24:43.1003861Z                                                      [A
2025-05-07T20:24:43.1004117Z 
2025-05-07T20:24:43.1004123Z 
2025-05-07T20:24:43.1004354Z                                                      [A[A
2025-05-07T20:24:43.1004645Z 
2025-05-07T20:24:43.1004674Z 
2025-05-07T20:24:43.1004679Z 
2025-05-07T20:24:43.1004897Z                                                      [A[A[A
2025-05-07T20:24:43.1005171Z 
2025-05-07T20:24:43.1005183Z 
2025-05-07T20:24:43.1005188Z 
2025-05-07T20:24:43.1005193Z 
2025-05-07T20:24:43.1005411Z                                                      [A[A[A[A
2025-05-07T20:24:43.1005881Z 
2025-05-07T20:24:43.1005884Z 
2025-05-07T20:24:43.1005888Z 
2025-05-07T20:24:43.1005899Z 
2025-05-07T20:24:43.1005902Z 
2025-05-07T20:24:43.1006087Z                                                      [A[A[A[A[A done
2025-05-07T20:24:43.3112463Z Preparing transaction: | / done
2025-05-07T20:24:44.7395979Z Verifying transaction: \ | / - \ | / - \ | / - \ done
2025-05-07T20:24:47.1577547Z Executing transaction: / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:24:47.2084864Z #
2025-05-07T20:24:47.2085530Z # To activate this environment, use
2025-05-07T20:24:47.2085842Z #
2025-05-07T20:24:47.2086045Z #     $ conda activate build_binary
2025-05-07T20:24:47.2086317Z #
2025-05-07T20:24:47.2086536Z # To deactivate an active environment, use
2025-05-07T20:24:47.2086834Z #
2025-05-07T20:24:47.2087026Z #     $ conda deactivate
2025-05-07T20:24:47.2087196Z 
2025-05-07T20:24:47.3252597Z [SETUP] Upgrading PIP to latest ...
2025-05-07T20:24:47.3277184Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --upgrade pip
2025-05-07T20:24:50.3485173Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (25.1)
2025-05-07T20:24:50.3485847Z Collecting pip
2025-05-07T20:24:50.3486166Z   Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
2025-05-07T20:24:50.3486599Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
2025-05-07T20:24:50.3489509Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 90.7 MB/s eta 0:00:00
2025-05-07T20:24:50.3489885Z Installing collected packages: pip
2025-05-07T20:24:50.3490239Z   Attempting uninstall: pip
2025-05-07T20:24:50.3490533Z     Found existing installation: pip 25.1
2025-05-07T20:24:50.3490843Z     Uninstalling pip-25.1:
2025-05-07T20:24:50.3491133Z       Successfully uninstalled pip-25.1
2025-05-07T20:24:50.3491447Z Successfully installed pip-25.1.1
2025-05-07T20:24:50.3491662Z 
2025-05-07T20:24:50.4173026Z [SETUP] Upgrading pyOpenSSL ...
2025-05-07T20:24:50.4197177Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0
2025-05-07T20:24:51.2771802Z Channels:
2025-05-07T20:24:51.2772045Z  - conda-forge
2025-05-07T20:24:51.2772280Z Platform: linux-64
2025-05-07T20:25:01.8836377Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:25:03.6035726Z Solving environment: / - \ | / done
2025-05-07T20:25:03.6656235Z 
2025-05-07T20:25:03.6656548Z ## Package Plan ##
2025-05-07T20:25:03.6656739Z 
2025-05-07T20:25:03.6657049Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:03.6657469Z 
2025-05-07T20:25:03.6657605Z   added / updated specs:
2025-05-07T20:25:03.6657982Z     - pyopenssl[version='>22.1.0']
2025-05-07T20:25:03.6658184Z 
2025-05-07T20:25:03.6658189Z 
2025-05-07T20:25:03.6658327Z The following packages will be downloaded:
2025-05-07T20:25:03.6658545Z 
2025-05-07T20:25:03.6658672Z     package                    |            build
2025-05-07T20:25:03.6659054Z     ---------------------------|-----------------
2025-05-07T20:25:03.6659597Z     cffi-1.17.1                |  py312h06ac9bb_0         288 KB  conda-forge
2025-05-07T20:25:03.6660293Z     cryptography-44.0.3        |  py312hda17c39_0         1.5 MB  conda-forge
2025-05-07T20:25:03.6660827Z     expat-2.7.0                |       h5888daf_0         137 KB  conda-forge
2025-05-07T20:25:03.6661238Z     libexpat-2.7.0             |       h5888daf_0          73 KB  conda-forge
2025-05-07T20:25:03.6661674Z     libgcc-15.1.0              |       h767d61c_2         810 KB  conda-forge
2025-05-07T20:25:03.6662097Z     libgcc-ng-15.1.0           |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:25:03.6662519Z     libgomp-15.1.0             |       h767d61c_2         442 KB  conda-forge
2025-05-07T20:25:03.6662942Z     libnsl-2.0.1               |       hd590300_0          33 KB  conda-forge
2025-05-07T20:25:03.6663692Z     libsqlite-3.46.0           |       hde9e2c9_0         845 KB  conda-forge
2025-05-07T20:25:03.6664126Z     libuuid-2.38.1             |       h0b41bf4_0          33 KB  conda-forge
2025-05-07T20:25:03.6664551Z     libxcrypt-4.4.36           |       hd590300_1          98 KB  conda-forge
2025-05-07T20:25:03.6664984Z     libzlib-1.2.13             |       h4ab18f5_6          60 KB  conda-forge
2025-05-07T20:25:03.6665406Z     openssl-3.5.0              |       h7b32b05_1         3.0 MB  conda-forge
2025-05-07T20:25:03.6665980Z     pycparser-2.22             |     pyh29332c3_1         108 KB  conda-forge
2025-05-07T20:25:03.6666670Z     pyopenssl-25.0.0           |     pyhd8ed1ab_0         120 KB  conda-forge
2025-05-07T20:25:03.6667131Z     python-3.12.2              |hab00c5b_0_cpython        30.8 MB  conda-forge
2025-05-07T20:25:03.6667585Z     python_abi-3.12            |          7_cp312           7 KB  conda-forge
2025-05-07T20:25:03.6668053Z     typing-extensions-4.13.2   |       h0e9735f_0          88 KB  conda-forge
2025-05-07T20:25:03.6668553Z     typing_extensions-4.13.2   |     pyh29332c3_0          51 KB  conda-forge
2025-05-07T20:25:03.6669003Z     zlib-1.2.13                |       h4ab18f5_6          91 KB  conda-forge
2025-05-07T20:25:03.6669396Z     ------------------------------------------------------------
2025-05-07T20:25:03.6669743Z                                            Total:        38.6 MB
2025-05-07T20:25:03.6669968Z 
2025-05-07T20:25:03.6670096Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:03.6670319Z 
2025-05-07T20:25:03.6670531Z   cffi               conda-forge/linux-64::cffi-1.17.1-py312h06ac9bb_0 
2025-05-07T20:25:03.6671049Z   cryptography       conda-forge/linux-64::cryptography-44.0.3-py312hda17c39_0 
2025-05-07T20:25:03.6671572Z   libexpat           conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 
2025-05-07T20:25:03.6672024Z   libgcc             conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 
2025-05-07T20:25:03.6672465Z   libnsl             conda-forge/linux-64::libnsl-2.0.1-hd590300_0 
2025-05-07T20:25:03.6675566Z   libsqlite          conda-forge/linux-64::libsqlite-3.46.0-hde9e2c9_0 
2025-05-07T20:25:03.6676115Z   libxcrypt          conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 
2025-05-07T20:25:03.6676796Z   libzlib            conda-forge/linux-64::libzlib-1.2.13-h4ab18f5_6 
2025-05-07T20:25:03.6677312Z   pycparser          conda-forge/noarch::pycparser-2.22-pyh29332c3_1 
2025-05-07T20:25:03.6677798Z   pyopenssl          conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 
2025-05-07T20:25:03.6678292Z   python_abi         conda-forge/noarch::python_abi-3.12-7_cp312 
2025-05-07T20:25:03.6678821Z   typing-extensions  conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 
2025-05-07T20:25:03.6679428Z   typing_extensions  conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 
2025-05-07T20:25:03.6679784Z 
2025-05-07T20:25:03.6679923Z The following packages will be UPDATED:
2025-05-07T20:25:03.6680244Z 
2025-05-07T20:25:03.6680653Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:25:03.6681453Z   libgcc-ng          pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 
2025-05-07T20:25:03.6682134Z   libgomp              pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 
2025-05-07T20:25:03.6682797Z   libuuid              pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 
2025-05-07T20:25:03.6683455Z   openssl              pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 
2025-05-07T20:25:03.6684174Z   zlib                    pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.2.13-h4ab18f5_6 
2025-05-07T20:25:03.6684534Z 
2025-05-07T20:25:03.6684765Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:25:03.6685218Z 
2025-05-07T20:25:03.6685474Z   expat                   pkgs/main::expat-2.7.1-h6a678d5_0 --> conda-forge::expat-2.7.0-h5888daf_0 
2025-05-07T20:25:03.6686115Z   python                pkgs/main::python-3.12.9-h5148396_0 --> conda-forge::python-3.12.2-hab00c5b_0_cpython 
2025-05-07T20:25:03.6686518Z 
2025-05-07T20:25:03.6686522Z 
2025-05-07T20:25:03.6686526Z 
2025-05-07T20:25:03.6686674Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:03.6687062Z python-3.12.2        | 30.8 MB   |            |   0% 
2025-05-07T20:25:03.6687298Z 
2025-05-07T20:25:03.6687729Z openssl-3.5.0        | 3.0 MB    |            |   0% [A
2025-05-07T20:25:03.6687970Z 
2025-05-07T20:25:03.6687974Z 
2025-05-07T20:25:03.6688333Z cryptography-44.0.3  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:25:03.6688707Z 
2025-05-07T20:25:03.6688713Z 
2025-05-07T20:25:03.6688723Z 
2025-05-07T20:25:03.6704798Z libsqlite-3.46.0     | 845 KB    |            |   0% [A[A[A
2025-05-07T20:25:03.6705067Z 
2025-05-07T20:25:03.6705081Z 
2025-05-07T20:25:03.6705084Z 
2025-05-07T20:25:03.6706637Z 
2025-05-07T20:25:03.6719653Z libgcc-15.1.0        | 810 KB    |            |   0% [A[A[A[A
2025-05-07T20:25:03.6719931Z 
2025-05-07T20:25:03.6719935Z 
2025-05-07T20:25:03.6719939Z 
2025-05-07T20:25:03.6719950Z 
2025-05-07T20:25:03.6726143Z 
2025-05-07T20:25:03.6736205Z libgomp-15.1.0       | 442 KB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:03.6736495Z 
2025-05-07T20:25:03.6736500Z 
2025-05-07T20:25:03.6736512Z 
2025-05-07T20:25:03.6736516Z 
2025-05-07T20:25:03.6736519Z 
2025-05-07T20:25:03.6736523Z 
2025-05-07T20:25:03.6739544Z cffi-1.17.1          | 288 KB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:03.6739870Z 
2025-05-07T20:25:03.6739876Z 
2025-05-07T20:25:03.6739882Z 
2025-05-07T20:25:03.6739887Z 
2025-05-07T20:25:03.6739892Z 
2025-05-07T20:25:03.6739897Z 
2025-05-07T20:25:03.6739902Z 
2025-05-07T20:25:03.6743189Z expat-2.7.0          | 137 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:03.6743531Z 
2025-05-07T20:25:03.6743536Z 
2025-05-07T20:25:03.6743539Z 
2025-05-07T20:25:03.6743543Z 
2025-05-07T20:25:03.6743546Z 
2025-05-07T20:25:03.6743550Z 
2025-05-07T20:25:03.6743553Z 
2025-05-07T20:25:03.6743556Z 
2025-05-07T20:25:03.6744667Z pyopenssl-25.0.0     | 120 KB    |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:03.6744983Z 
2025-05-07T20:25:03.6744987Z 
2025-05-07T20:25:03.6744990Z 
2025-05-07T20:25:03.6744994Z 
2025-05-07T20:25:03.6744998Z 
2025-05-07T20:25:03.6745001Z 
2025-05-07T20:25:03.6745005Z 
2025-05-07T20:25:03.6745008Z 
2025-05-07T20:25:03.6745015Z 
2025-05-07T20:25:03.6746693Z pycparser-2.22       | 108 KB    |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.6746996Z 
2025-05-07T20:25:03.6747000Z 
2025-05-07T20:25:03.6747004Z 
2025-05-07T20:25:03.6747007Z 
2025-05-07T20:25:03.6747011Z 
2025-05-07T20:25:03.6747014Z 
2025-05-07T20:25:03.6747025Z 
2025-05-07T20:25:03.6747028Z 
2025-05-07T20:25:03.6747031Z 
2025-05-07T20:25:03.6753908Z 
2025-05-07T20:25:03.6754883Z libxcrypt-4.4.36     | 98 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.6755208Z 
2025-05-07T20:25:03.6755212Z 
2025-05-07T20:25:03.6755215Z 
2025-05-07T20:25:03.6755219Z 
2025-05-07T20:25:03.6755222Z 
2025-05-07T20:25:03.6755226Z 
2025-05-07T20:25:03.6755229Z 
2025-05-07T20:25:03.6755236Z 
2025-05-07T20:25:03.6755239Z 
2025-05-07T20:25:03.6755243Z 
2025-05-07T20:25:03.6755246Z 
2025-05-07T20:25:03.6756321Z zlib-1.2.13          | 91 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.6756639Z 
2025-05-07T20:25:03.6756643Z 
2025-05-07T20:25:03.6756646Z 
2025-05-07T20:25:03.6756650Z 
2025-05-07T20:25:03.6756653Z 
2025-05-07T20:25:03.6756664Z 
2025-05-07T20:25:03.6756668Z 
2025-05-07T20:25:03.6756676Z 
2025-05-07T20:25:03.6756679Z 
2025-05-07T20:25:03.6756683Z 
2025-05-07T20:25:03.6756686Z 
2025-05-07T20:25:03.6756690Z 
2025-05-07T20:25:03.6758028Z typing-extensions-4. | 88 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.6758551Z 
2025-05-07T20:25:03.6758560Z 
2025-05-07T20:25:03.6758564Z 
2025-05-07T20:25:03.6758567Z 
2025-05-07T20:25:03.6758571Z 
2025-05-07T20:25:03.6758574Z 
2025-05-07T20:25:03.6758578Z 
2025-05-07T20:25:03.6758581Z 
2025-05-07T20:25:03.6758592Z 
2025-05-07T20:25:03.6758596Z 
2025-05-07T20:25:03.6758599Z 
2025-05-07T20:25:03.6758603Z 
2025-05-07T20:25:03.6758606Z 
2025-05-07T20:25:03.6759498Z libexpat-2.7.0       | 73 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.6759825Z 
2025-05-07T20:25:03.6759835Z 
2025-05-07T20:25:03.6759839Z 
2025-05-07T20:25:03.6759842Z 
2025-05-07T20:25:03.6759846Z 
2025-05-07T20:25:03.6759854Z 
2025-05-07T20:25:03.6760023Z 
2025-05-07T20:25:03.6760028Z 
2025-05-07T20:25:03.6760032Z 
2025-05-07T20:25:03.6760035Z 
2025-05-07T20:25:03.6760039Z 
2025-05-07T20:25:03.6760042Z 
2025-05-07T20:25:03.6760046Z 
2025-05-07T20:25:03.6760049Z 
2025-05-07T20:25:03.6761096Z libzlib-1.2.13       | 60 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.6761424Z 
2025-05-07T20:25:03.6761428Z 
2025-05-07T20:25:03.6761431Z 
2025-05-07T20:25:03.6761440Z 
2025-05-07T20:25:03.6761444Z 
2025-05-07T20:25:03.6761447Z 
2025-05-07T20:25:03.6761451Z 
2025-05-07T20:25:03.6761454Z 
2025-05-07T20:25:03.6761457Z 
2025-05-07T20:25:03.6761461Z 
2025-05-07T20:25:03.6761464Z 
2025-05-07T20:25:03.6761468Z 
2025-05-07T20:25:03.6761471Z 
2025-05-07T20:25:03.6761481Z 
2025-05-07T20:25:03.6761485Z 
2025-05-07T20:25:03.6763539Z typing_extensions-4. | 51 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.6763877Z 
2025-05-07T20:25:03.6763881Z 
2025-05-07T20:25:03.6763892Z 
2025-05-07T20:25:03.6763905Z 
2025-05-07T20:25:03.6763908Z 
2025-05-07T20:25:03.6763912Z 
2025-05-07T20:25:03.6763915Z 
2025-05-07T20:25:03.6763919Z 
2025-05-07T20:25:03.6763922Z 
2025-05-07T20:25:03.6763926Z 
2025-05-07T20:25:03.6763929Z 
2025-05-07T20:25:03.6763933Z 
2025-05-07T20:25:03.6763939Z 
2025-05-07T20:25:03.6763950Z 
2025-05-07T20:25:03.6763954Z 
2025-05-07T20:25:03.6763957Z 
2025-05-07T20:25:03.6764694Z libgcc-ng-15.1.0     | 34 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.6765120Z 
2025-05-07T20:25:03.6765135Z 
2025-05-07T20:25:03.6765140Z 
2025-05-07T20:25:03.6765146Z 
2025-05-07T20:25:03.6765151Z 
2025-05-07T20:25:03.6765156Z 
2025-05-07T20:25:03.6765161Z 
2025-05-07T20:25:03.6765166Z 
2025-05-07T20:25:03.6765171Z 
2025-05-07T20:25:03.6765176Z 
2025-05-07T20:25:03.6765181Z 
2025-05-07T20:25:03.6765186Z 
2025-05-07T20:25:03.6765191Z 
2025-05-07T20:25:03.6765196Z 
2025-05-07T20:25:03.6765201Z 
2025-05-07T20:25:03.6765206Z 
2025-05-07T20:25:03.6765216Z 
2025-05-07T20:25:03.6766106Z libuuid-2.38.1       | 33 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.6766459Z 
2025-05-07T20:25:03.6766463Z 
2025-05-07T20:25:03.6766467Z 
2025-05-07T20:25:03.6766471Z 
2025-05-07T20:25:03.6766475Z 
2025-05-07T20:25:03.6766478Z 
2025-05-07T20:25:03.6766486Z 
2025-05-07T20:25:03.6766490Z 
2025-05-07T20:25:03.6766493Z 
2025-05-07T20:25:03.6766497Z 
2025-05-07T20:25:03.6766500Z 
2025-05-07T20:25:03.6766510Z 
2025-05-07T20:25:03.6766514Z 
2025-05-07T20:25:03.6766517Z 
2025-05-07T20:25:03.6766521Z 
2025-05-07T20:25:03.6766524Z 
2025-05-07T20:25:03.6766527Z 
2025-05-07T20:25:03.6766535Z 
2025-05-07T20:25:03.6767427Z libnsl-2.0.1         | 33 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.6767809Z 
2025-05-07T20:25:03.6767813Z 
2025-05-07T20:25:03.6767825Z 
2025-05-07T20:25:03.6767828Z 
2025-05-07T20:25:03.6767832Z 
2025-05-07T20:25:03.6767835Z 
2025-05-07T20:25:03.6767839Z 
2025-05-07T20:25:03.6767850Z 
2025-05-07T20:25:03.6767853Z 
2025-05-07T20:25:03.6767857Z 
2025-05-07T20:25:03.6767861Z 
2025-05-07T20:25:03.6767864Z 
2025-05-07T20:25:03.6767867Z 
2025-05-07T20:25:03.6767871Z 
2025-05-07T20:25:03.6767874Z 
2025-05-07T20:25:03.6767878Z 
2025-05-07T20:25:03.6767881Z 
2025-05-07T20:25:03.6768045Z 
2025-05-07T20:25:03.6768048Z 
2025-05-07T20:25:03.7639758Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.7640088Z 
2025-05-07T20:25:03.7640091Z 
2025-05-07T20:25:03.7666957Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:25:03.7675219Z python-3.12.2        | 30.8 MB   |            |   1% 
2025-05-07T20:25:03.7675532Z 
2025-05-07T20:25:03.7710308Z openssl-3.5.0        | 3.0 MB    | 6          |   7% [A
2025-05-07T20:25:03.7710588Z 
2025-05-07T20:25:03.7710592Z 
2025-05-07T20:25:03.7710596Z 
2025-05-07T20:25:03.7712047Z 
2025-05-07T20:25:03.7717431Z libgcc-15.1.0        | 810 KB    | ######5    |  65% [A[A[A[A
2025-05-07T20:25:03.7717925Z 
2025-05-07T20:25:03.7717932Z 
2025-05-07T20:25:03.7717940Z 
2025-05-07T20:25:03.8026726Z libsqlite-3.46.0     | 845 KB    | 1          |   2% [A[A[A
2025-05-07T20:25:03.8027012Z 
2025-05-07T20:25:03.8027016Z 
2025-05-07T20:25:03.8027020Z 
2025-05-07T20:25:03.8028677Z 
2025-05-07T20:25:03.8140478Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:03.8140805Z 
2025-05-07T20:25:03.8140903Z 
2025-05-07T20:25:03.8140908Z 
2025-05-07T20:25:03.8140913Z 
2025-05-07T20:25:03.8140923Z 
2025-05-07T20:25:03.8299958Z libgomp-15.1.0       | 442 KB    | 3          |   4% [A[A[A[A[A
2025-05-07T20:25:03.8300242Z 
2025-05-07T20:25:03.8300246Z 
2025-05-07T20:25:03.8303183Z 
2025-05-07T20:25:03.8425080Z libsqlite-3.46.0     | 845 KB    | ########## | 100% [A[A[A
2025-05-07T20:25:03.8425355Z 
2025-05-07T20:25:03.8425359Z 
2025-05-07T20:25:03.8425363Z 
2025-05-07T20:25:03.8425367Z 
2025-05-07T20:25:03.8428720Z 
2025-05-07T20:25:03.8564207Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:03.8564489Z 
2025-05-07T20:25:03.8564493Z 
2025-05-07T20:25:03.8564496Z 
2025-05-07T20:25:03.8564500Z 
2025-05-07T20:25:03.8564503Z 
2025-05-07T20:25:03.8564507Z 
2025-05-07T20:25:03.8667964Z cffi-1.17.1          | 288 KB    | 5          |   6% [A[A[A[A[A[A
2025-05-07T20:25:03.8735865Z python-3.12.2        | 30.8 MB   | #1         |  11% 
2025-05-07T20:25:03.8736106Z 
2025-05-07T20:25:03.8736400Z 
2025-05-07T20:25:03.8736406Z 
2025-05-07T20:25:03.8736410Z 
2025-05-07T20:25:03.8736414Z 
2025-05-07T20:25:03.8737686Z 
2025-05-07T20:25:03.8828696Z cffi-1.17.1          | 288 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:03.8829088Z 
2025-05-07T20:25:03.8829094Z 
2025-05-07T20:25:03.8829100Z 
2025-05-07T20:25:03.8829106Z 
2025-05-07T20:25:03.8829112Z 
2025-05-07T20:25:03.8829118Z 
2025-05-07T20:25:03.8829123Z 
2025-05-07T20:25:03.8947447Z expat-2.7.0          | 137 KB    | #1         |  12% [A[A[A[A[A[A[A
2025-05-07T20:25:03.8947767Z 
2025-05-07T20:25:03.8947774Z 
2025-05-07T20:25:03.8947779Z 
2025-05-07T20:25:03.8947785Z 
2025-05-07T20:25:03.8947789Z 
2025-05-07T20:25:03.8947794Z 
2025-05-07T20:25:03.8949174Z 
2025-05-07T20:25:03.9013532Z expat-2.7.0          | 137 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:03.9013931Z 
2025-05-07T20:25:03.9013935Z 
2025-05-07T20:25:03.9013938Z 
2025-05-07T20:25:03.9013942Z 
2025-05-07T20:25:03.9013946Z 
2025-05-07T20:25:03.9013949Z 
2025-05-07T20:25:03.9013960Z 
2025-05-07T20:25:03.9016165Z 
2025-05-07T20:25:03.9113170Z pyopenssl-25.0.0     | 120 KB    | #3         |  13% [A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9113489Z 
2025-05-07T20:25:03.9113493Z 
2025-05-07T20:25:03.9113507Z 
2025-05-07T20:25:03.9113511Z 
2025-05-07T20:25:03.9113514Z 
2025-05-07T20:25:03.9113518Z 
2025-05-07T20:25:03.9113521Z 
2025-05-07T20:25:03.9117896Z 
2025-05-07T20:25:03.9308750Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9309116Z 
2025-05-07T20:25:03.9309133Z 
2025-05-07T20:25:03.9309137Z 
2025-05-07T20:25:03.9309141Z 
2025-05-07T20:25:03.9309144Z 
2025-05-07T20:25:03.9309148Z 
2025-05-07T20:25:03.9309152Z 
2025-05-07T20:25:03.9309155Z 
2025-05-07T20:25:03.9318810Z 
2025-05-07T20:25:03.9328782Z pycparser-2.22       | 108 KB    | #4         |  15% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9333006Z 
2025-05-07T20:25:03.9333427Z openssl-3.5.0        | 3.0 MB    | ########## | 100% [A
2025-05-07T20:25:03.9333690Z 
2025-05-07T20:25:03.9420148Z openssl-3.5.0        | 3.0 MB    | ########## | 100% [A
2025-05-07T20:25:03.9420509Z 
2025-05-07T20:25:03.9420524Z 
2025-05-07T20:25:03.9420530Z 
2025-05-07T20:25:03.9420535Z 
2025-05-07T20:25:03.9420540Z 
2025-05-07T20:25:03.9420546Z 
2025-05-07T20:25:03.9420552Z 
2025-05-07T20:25:03.9420557Z 
2025-05-07T20:25:03.9421401Z 
2025-05-07T20:25:03.9528072Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9528447Z 
2025-05-07T20:25:03.9528451Z 
2025-05-07T20:25:03.9528674Z 
2025-05-07T20:25:03.9528681Z 
2025-05-07T20:25:03.9528686Z 
2025-05-07T20:25:03.9528691Z 
2025-05-07T20:25:03.9528696Z 
2025-05-07T20:25:03.9528701Z 
2025-05-07T20:25:03.9528706Z 
2025-05-07T20:25:03.9528711Z 
2025-05-07T20:25:03.9528716Z 
2025-05-07T20:25:03.9541975Z zlib-1.2.13          | 91 KB     | #7         |  18% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9542303Z 
2025-05-07T20:25:03.9542309Z 
2025-05-07T20:25:03.9543754Z 
2025-05-07T20:25:03.9555901Z libsqlite-3.46.0     | 845 KB    | ########## | 100% [A[A[A
2025-05-07T20:25:03.9556306Z 
2025-05-07T20:25:03.9556311Z 
2025-05-07T20:25:03.9557124Z 
2025-05-07T20:25:03.9589522Z libsqlite-3.46.0     | 845 KB    | ########## | 100% [A[A[A
2025-05-07T20:25:03.9589815Z 
2025-05-07T20:25:03.9589819Z 
2025-05-07T20:25:03.9589823Z 
2025-05-07T20:25:03.9589826Z 
2025-05-07T20:25:03.9589830Z 
2025-05-07T20:25:03.9589833Z 
2025-05-07T20:25:03.9589836Z 
2025-05-07T20:25:03.9589840Z 
2025-05-07T20:25:03.9589844Z 
2025-05-07T20:25:03.9594455Z 
2025-05-07T20:25:03.9596473Z libxcrypt-4.4.36     | 98 KB     | #6         |  16% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9596850Z 
2025-05-07T20:25:03.9596855Z 
2025-05-07T20:25:03.9596858Z 
2025-05-07T20:25:03.9596862Z 
2025-05-07T20:25:03.9596865Z 
2025-05-07T20:25:03.9596878Z 
2025-05-07T20:25:03.9596882Z 
2025-05-07T20:25:03.9596885Z 
2025-05-07T20:25:03.9596896Z 
2025-05-07T20:25:03.9596900Z 
2025-05-07T20:25:03.9597630Z 
2025-05-07T20:25:03.9671398Z zlib-1.2.13          | 91 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9675090Z python-3.12.2        | 30.8 MB   | ##2        |  23% 
2025-05-07T20:25:03.9675436Z 
2025-05-07T20:25:03.9675440Z 
2025-05-07T20:25:03.9675444Z 
2025-05-07T20:25:03.9675447Z 
2025-05-07T20:25:03.9675451Z 
2025-05-07T20:25:03.9675454Z 
2025-05-07T20:25:03.9675458Z 
2025-05-07T20:25:03.9675470Z 
2025-05-07T20:25:03.9675474Z 
2025-05-07T20:25:03.9676088Z 
2025-05-07T20:25:03.9725006Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9725654Z 
2025-05-07T20:25:03.9725659Z 
2025-05-07T20:25:03.9725665Z 
2025-05-07T20:25:03.9725670Z 
2025-05-07T20:25:03.9725676Z 
2025-05-07T20:25:03.9725682Z 
2025-05-07T20:25:03.9725687Z 
2025-05-07T20:25:03.9725693Z 
2025-05-07T20:25:03.9725744Z 
2025-05-07T20:25:03.9725753Z 
2025-05-07T20:25:03.9725760Z 
2025-05-07T20:25:03.9727929Z 
2025-05-07T20:25:03.9771134Z typing-extensions-4. | 88 KB     | #8         |  18% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9771484Z 
2025-05-07T20:25:03.9771490Z 
2025-05-07T20:25:03.9771493Z 
2025-05-07T20:25:03.9771497Z 
2025-05-07T20:25:03.9771500Z 
2025-05-07T20:25:03.9771504Z 
2025-05-07T20:25:03.9771507Z 
2025-05-07T20:25:03.9771511Z 
2025-05-07T20:25:03.9771514Z 
2025-05-07T20:25:03.9771518Z 
2025-05-07T20:25:03.9771522Z 
2025-05-07T20:25:03.9774014Z 
2025-05-07T20:25:03.9867785Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9868311Z 
2025-05-07T20:25:03.9868319Z 
2025-05-07T20:25:03.9868324Z 
2025-05-07T20:25:03.9868330Z 
2025-05-07T20:25:03.9868336Z 
2025-05-07T20:25:03.9868341Z 
2025-05-07T20:25:03.9868357Z 
2025-05-07T20:25:03.9868363Z 
2025-05-07T20:25:03.9868370Z 
2025-05-07T20:25:03.9868376Z 
2025-05-07T20:25:03.9868654Z 
2025-05-07T20:25:03.9868659Z 
2025-05-07T20:25:03.9869448Z 
2025-05-07T20:25:03.9930344Z libexpat-2.7.0       | 73 KB     | ##2        |  22% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:03.9930769Z 
2025-05-07T20:25:03.9930774Z 
2025-05-07T20:25:03.9930780Z 
2025-05-07T20:25:03.9930785Z 
2025-05-07T20:25:03.9930791Z 
2025-05-07T20:25:03.9930807Z 
2025-05-07T20:25:03.9930812Z 
2025-05-07T20:25:03.9930817Z 
2025-05-07T20:25:03.9930823Z 
2025-05-07T20:25:03.9930828Z 
2025-05-07T20:25:03.9930833Z 
2025-05-07T20:25:03.9930838Z 
2025-05-07T20:25:03.9932702Z 
2025-05-07T20:25:04.0163035Z libexpat-2.7.0       | 73 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.0163689Z 
2025-05-07T20:25:04.0163695Z 
2025-05-07T20:25:04.0163698Z 
2025-05-07T20:25:04.0163702Z 
2025-05-07T20:25:04.0163705Z 
2025-05-07T20:25:04.0163709Z 
2025-05-07T20:25:04.0163712Z 
2025-05-07T20:25:04.0163716Z 
2025-05-07T20:25:04.0163719Z 
2025-05-07T20:25:04.0163723Z 
2025-05-07T20:25:04.0163738Z 
2025-05-07T20:25:04.0163741Z 
2025-05-07T20:25:04.0163745Z 
2025-05-07T20:25:04.0164110Z 
2025-05-07T20:25:04.0215249Z libzlib-1.2.13       | 60 KB     | ##6        |  27% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.0215572Z 
2025-05-07T20:25:04.0215576Z 
2025-05-07T20:25:04.0215580Z 
2025-05-07T20:25:04.0215584Z 
2025-05-07T20:25:04.0215587Z 
2025-05-07T20:25:04.0215591Z 
2025-05-07T20:25:04.0215595Z 
2025-05-07T20:25:04.0215599Z 
2025-05-07T20:25:04.0215613Z 
2025-05-07T20:25:04.0215617Z 
2025-05-07T20:25:04.0215620Z 
2025-05-07T20:25:04.0215624Z 
2025-05-07T20:25:04.0215627Z 
2025-05-07T20:25:04.0215631Z 
2025-05-07T20:25:04.0215634Z 
2025-05-07T20:25:04.0215638Z 
2025-05-07T20:25:04.0220713Z libgcc-ng-15.1.0     | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.0221040Z 
2025-05-07T20:25:04.0221044Z 
2025-05-07T20:25:04.0221048Z 
2025-05-07T20:25:04.0221051Z 
2025-05-07T20:25:04.0221055Z 
2025-05-07T20:25:04.0221058Z 
2025-05-07T20:25:04.0221068Z 
2025-05-07T20:25:04.0221071Z 
2025-05-07T20:25:04.0221075Z 
2025-05-07T20:25:04.0221078Z 
2025-05-07T20:25:04.0221082Z 
2025-05-07T20:25:04.0221085Z 
2025-05-07T20:25:04.0221089Z 
2025-05-07T20:25:04.0222475Z 
2025-05-07T20:25:04.0259932Z libzlib-1.2.13       | 60 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.0260243Z 
2025-05-07T20:25:04.0260247Z 
2025-05-07T20:25:04.0260251Z 
2025-05-07T20:25:04.0260254Z 
2025-05-07T20:25:04.0260258Z 
2025-05-07T20:25:04.0260261Z 
2025-05-07T20:25:04.0260265Z 
2025-05-07T20:25:04.0260268Z 
2025-05-07T20:25:04.0260272Z 
2025-05-07T20:25:04.0260285Z 
2025-05-07T20:25:04.0260288Z 
2025-05-07T20:25:04.0260292Z 
2025-05-07T20:25:04.0260308Z 
2025-05-07T20:25:04.0260312Z 
2025-05-07T20:25:04.0260319Z 
2025-05-07T20:25:04.0271942Z typing_extensions-4. | 51 KB     | ###1       |  31% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.0272296Z 
2025-05-07T20:25:04.0272300Z 
2025-05-07T20:25:04.0272303Z 
2025-05-07T20:25:04.0272317Z 
2025-05-07T20:25:04.0272321Z 
2025-05-07T20:25:04.0272325Z 
2025-05-07T20:25:04.0272328Z 
2025-05-07T20:25:04.0272332Z 
2025-05-07T20:25:04.0272335Z 
2025-05-07T20:25:04.0272339Z 
2025-05-07T20:25:04.0272342Z 
2025-05-07T20:25:04.0272346Z 
2025-05-07T20:25:04.0272349Z 
2025-05-07T20:25:04.0272353Z 
2025-05-07T20:25:04.0272356Z 
2025-05-07T20:25:04.0275846Z 
2025-05-07T20:25:04.0326674Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.0327036Z 
2025-05-07T20:25:04.0327041Z 
2025-05-07T20:25:04.0327047Z 
2025-05-07T20:25:04.0327052Z 
2025-05-07T20:25:04.0327057Z 
2025-05-07T20:25:04.0327062Z 
2025-05-07T20:25:04.0327079Z 
2025-05-07T20:25:04.0327085Z 
2025-05-07T20:25:04.0327099Z 
2025-05-07T20:25:04.0327105Z 
2025-05-07T20:25:04.0327110Z 
2025-05-07T20:25:04.0327115Z 
2025-05-07T20:25:04.0327120Z 
2025-05-07T20:25:04.0327125Z 
2025-05-07T20:25:04.0327130Z 
2025-05-07T20:25:04.0402881Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.0403624Z 
2025-05-07T20:25:04.0403629Z 
2025-05-07T20:25:04.0403632Z 
2025-05-07T20:25:04.0403636Z 
2025-05-07T20:25:04.0591222Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:04.0591583Z 
2025-05-07T20:25:04.0591587Z 
2025-05-07T20:25:04.0591591Z 
2025-05-07T20:25:04.0591595Z 
2025-05-07T20:25:04.0591598Z 
2025-05-07T20:25:04.0591602Z 
2025-05-07T20:25:04.0591606Z 
2025-05-07T20:25:04.0591609Z 
2025-05-07T20:25:04.0591613Z 
2025-05-07T20:25:04.0591616Z 
2025-05-07T20:25:04.0591620Z 
2025-05-07T20:25:04.0591623Z 
2025-05-07T20:25:04.0591627Z 
2025-05-07T20:25:04.0591630Z 
2025-05-07T20:25:04.0591903Z 
2025-05-07T20:25:04.0591908Z 
2025-05-07T20:25:04.0591915Z 
2025-05-07T20:25:04.0627950Z libuuid-2.38.1       | 33 KB     | ####8      |  49% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.0628550Z 
2025-05-07T20:25:04.0628554Z 
2025-05-07T20:25:04.0628558Z 
2025-05-07T20:25:04.0628575Z 
2025-05-07T20:25:04.0628578Z 
2025-05-07T20:25:04.0628582Z 
2025-05-07T20:25:04.0628586Z 
2025-05-07T20:25:04.0628589Z 
2025-05-07T20:25:04.0628603Z 
2025-05-07T20:25:04.0628607Z 
2025-05-07T20:25:04.0628611Z 
2025-05-07T20:25:04.0628614Z 
2025-05-07T20:25:04.0628617Z 
2025-05-07T20:25:04.0628621Z 
2025-05-07T20:25:04.0628624Z 
2025-05-07T20:25:04.0628628Z 
2025-05-07T20:25:04.0633502Z 
2025-05-07T20:25:04.0653099Z libuuid-2.38.1       | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.0653505Z 
2025-05-07T20:25:04.0653511Z 
2025-05-07T20:25:04.0653517Z 
2025-05-07T20:25:04.0653522Z 
2025-05-07T20:25:04.0653528Z 
2025-05-07T20:25:04.0653547Z 
2025-05-07T20:25:04.0653553Z 
2025-05-07T20:25:04.0653558Z 
2025-05-07T20:25:04.0653564Z 
2025-05-07T20:25:04.0653569Z 
2025-05-07T20:25:04.0653574Z 
2025-05-07T20:25:04.0653580Z 
2025-05-07T20:25:04.0653585Z 
2025-05-07T20:25:04.0653590Z 
2025-05-07T20:25:04.0653595Z 
2025-05-07T20:25:04.0653608Z 
2025-05-07T20:25:04.0653614Z 
2025-05-07T20:25:04.0653629Z 
2025-05-07T20:25:04.0673451Z libnsl-2.0.1         | 33 KB     | ####9      |  49% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.0683479Z python-3.12.2        | 30.8 MB   | ###6       |  36% 
2025-05-07T20:25:04.0683734Z 
2025-05-07T20:25:04.0683738Z 
2025-05-07T20:25:04.0683741Z 
2025-05-07T20:25:04.0683745Z 
2025-05-07T20:25:04.0683748Z 
2025-05-07T20:25:04.0683751Z 
2025-05-07T20:25:04.0683755Z 
2025-05-07T20:25:04.0683758Z 
2025-05-07T20:25:04.0683769Z 
2025-05-07T20:25:04.0683773Z 
2025-05-07T20:25:04.0683777Z 
2025-05-07T20:25:04.0683780Z 
2025-05-07T20:25:04.0683784Z 
2025-05-07T20:25:04.0683787Z 
2025-05-07T20:25:04.0683800Z 
2025-05-07T20:25:04.0683804Z 
2025-05-07T20:25:04.0683807Z 
2025-05-07T20:25:04.0686200Z 
2025-05-07T20:25:04.0788507Z libnsl-2.0.1         | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.0788894Z 
2025-05-07T20:25:04.0788900Z 
2025-05-07T20:25:04.0788931Z 
2025-05-07T20:25:04.0788937Z 
2025-05-07T20:25:04.0788944Z 
2025-05-07T20:25:04.0788951Z 
2025-05-07T20:25:04.0788957Z 
2025-05-07T20:25:04.0788964Z 
2025-05-07T20:25:04.0788970Z 
2025-05-07T20:25:04.0788978Z 
2025-05-07T20:25:04.0788986Z 
2025-05-07T20:25:04.0788993Z 
2025-05-07T20:25:04.0789000Z 
2025-05-07T20:25:04.0789007Z 
2025-05-07T20:25:04.0789013Z 
2025-05-07T20:25:04.0789018Z 
2025-05-07T20:25:04.0789031Z 
2025-05-07T20:25:04.0789036Z 
2025-05-07T20:25:04.0789040Z 
2025-05-07T20:25:04.0815596Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.0815911Z 
2025-05-07T20:25:04.0815915Z 
2025-05-07T20:25:04.0815919Z 
2025-05-07T20:25:04.0815935Z 
2025-05-07T20:25:04.0815939Z 
2025-05-07T20:25:04.0815943Z 
2025-05-07T20:25:04.0815946Z 
2025-05-07T20:25:04.0815950Z 
2025-05-07T20:25:04.0815953Z 
2025-05-07T20:25:04.0815957Z 
2025-05-07T20:25:04.0815960Z 
2025-05-07T20:25:04.0815963Z 
2025-05-07T20:25:04.0815967Z 
2025-05-07T20:25:04.0816172Z 
2025-05-07T20:25:04.0816175Z 
2025-05-07T20:25:04.0816179Z 
2025-05-07T20:25:04.0816182Z 
2025-05-07T20:25:04.0816186Z 
2025-05-07T20:25:04.0816189Z 
2025-05-07T20:25:04.1062960Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.1063330Z 
2025-05-07T20:25:04.1063334Z 
2025-05-07T20:25:04.1063338Z 
2025-05-07T20:25:04.1063342Z 
2025-05-07T20:25:04.1063345Z 
2025-05-07T20:25:04.1067288Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:04.1067680Z 
2025-05-07T20:25:04.1067686Z 
2025-05-07T20:25:04.1067692Z 
2025-05-07T20:25:04.1067697Z 
2025-05-07T20:25:04.1067702Z 
2025-05-07T20:25:04.1677432Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:04.1694065Z python-3.12.2        | 30.8 MB   | #####2     |  53% 
2025-05-07T20:25:04.1694335Z 
2025-05-07T20:25:04.1694338Z 
2025-05-07T20:25:04.1694342Z 
2025-05-07T20:25:04.1694345Z 
2025-05-07T20:25:04.1694349Z 
2025-05-07T20:25:04.1694372Z 
2025-05-07T20:25:04.1694574Z 
2025-05-07T20:25:04.1700200Z expat-2.7.0          | 137 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:04.1700496Z 
2025-05-07T20:25:04.1700500Z 
2025-05-07T20:25:04.1700503Z 
2025-05-07T20:25:04.1700507Z 
2025-05-07T20:25:04.1700510Z 
2025-05-07T20:25:04.1700514Z 
2025-05-07T20:25:04.1700651Z 
2025-05-07T20:25:04.1708743Z expat-2.7.0          | 137 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:04.1709104Z 
2025-05-07T20:25:04.1709109Z 
2025-05-07T20:25:04.1709114Z 
2025-05-07T20:25:04.1709118Z 
2025-05-07T20:25:04.1709123Z 
2025-05-07T20:25:04.1710924Z 
2025-05-07T20:25:04.1720161Z cffi-1.17.1          | 288 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:04.1720506Z 
2025-05-07T20:25:04.1720511Z 
2025-05-07T20:25:04.1720517Z 
2025-05-07T20:25:04.1720522Z 
2025-05-07T20:25:04.1720527Z 
2025-05-07T20:25:04.1723319Z 
2025-05-07T20:25:04.2443670Z cffi-1.17.1          | 288 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:04.2444093Z 
2025-05-07T20:25:04.2444100Z 
2025-05-07T20:25:04.2444104Z 
2025-05-07T20:25:04.2444108Z 
2025-05-07T20:25:04.2444111Z 
2025-05-07T20:25:04.2444115Z 
2025-05-07T20:25:04.2444119Z 
2025-05-07T20:25:04.2444122Z 
2025-05-07T20:25:04.2447790Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:04.2448167Z 
2025-05-07T20:25:04.2448173Z 
2025-05-07T20:25:04.2448178Z 
2025-05-07T20:25:04.2448183Z 
2025-05-07T20:25:04.2448188Z 
2025-05-07T20:25:04.2448203Z 
2025-05-07T20:25:04.2448209Z 
2025-05-07T20:25:04.2448213Z 
2025-05-07T20:25:04.2676934Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:04.3681256Z python-3.12.2        | 30.8 MB   | ######7    |  67% 
2025-05-07T20:25:04.4975784Z python-3.12.2        | 30.8 MB   | ########5  |  86% 
2025-05-07T20:25:04.4976048Z 
2025-05-07T20:25:04.4976154Z 
2025-05-07T20:25:04.4976158Z 
2025-05-07T20:25:04.4976174Z 
2025-05-07T20:25:04.4976304Z 
2025-05-07T20:25:04.4976336Z 
2025-05-07T20:25:04.4976341Z 
2025-05-07T20:25:04.4976345Z 
2025-05-07T20:25:04.4978343Z 
2025-05-07T20:25:04.4993640Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.4994006Z 
2025-05-07T20:25:04.4994011Z 
2025-05-07T20:25:04.4994016Z 
2025-05-07T20:25:04.4994020Z 
2025-05-07T20:25:04.4994025Z 
2025-05-07T20:25:04.4994041Z 
2025-05-07T20:25:04.4994046Z 
2025-05-07T20:25:04.4994050Z 
2025-05-07T20:25:04.4994055Z 
2025-05-07T20:25:04.5203585Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.5203883Z 
2025-05-07T20:25:04.5205025Z 
2025-05-07T20:25:04.5211815Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:25:04.5212089Z 
2025-05-07T20:25:04.5214873Z 
2025-05-07T20:25:04.5441596Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:25:04.5441972Z 
2025-05-07T20:25:04.5441978Z 
2025-05-07T20:25:04.5441983Z 
2025-05-07T20:25:04.5441988Z 
2025-05-07T20:25:04.5442344Z 
2025-05-07T20:25:04.5442364Z 
2025-05-07T20:25:04.5442370Z 
2025-05-07T20:25:04.5442375Z 
2025-05-07T20:25:04.5442380Z 
2025-05-07T20:25:04.5442385Z 
2025-05-07T20:25:04.5442397Z 
2025-05-07T20:25:04.5447906Z zlib-1.2.13          | 91 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.5448293Z 
2025-05-07T20:25:04.5448297Z 
2025-05-07T20:25:04.5448301Z 
2025-05-07T20:25:04.5448304Z 
2025-05-07T20:25:04.5448308Z 
2025-05-07T20:25:04.5448312Z 
2025-05-07T20:25:04.5448316Z 
2025-05-07T20:25:04.5448319Z 
2025-05-07T20:25:04.5448323Z 
2025-05-07T20:25:04.5448326Z 
2025-05-07T20:25:04.5448330Z 
2025-05-07T20:25:04.5645077Z zlib-1.2.13          | 91 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.5645431Z 
2025-05-07T20:25:04.5645435Z 
2025-05-07T20:25:04.5645438Z 
2025-05-07T20:25:04.5645442Z 
2025-05-07T20:25:04.5645445Z 
2025-05-07T20:25:04.5645449Z 
2025-05-07T20:25:04.5645453Z 
2025-05-07T20:25:04.5645456Z 
2025-05-07T20:25:04.5645470Z 
2025-05-07T20:25:04.5645473Z 
2025-05-07T20:25:04.5648509Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.5648917Z 
2025-05-07T20:25:04.5648921Z 
2025-05-07T20:25:04.5648924Z 
2025-05-07T20:25:04.5648928Z 
2025-05-07T20:25:04.5648931Z 
2025-05-07T20:25:04.5648934Z 
2025-05-07T20:25:04.5648938Z 
2025-05-07T20:25:04.5648941Z 
2025-05-07T20:25:04.5648945Z 
2025-05-07T20:25:04.5651460Z 
2025-05-07T20:25:04.5785411Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.5785817Z 
2025-05-07T20:25:04.5785820Z 
2025-05-07T20:25:04.5785832Z 
2025-05-07T20:25:04.5785836Z 
2025-05-07T20:25:04.5785839Z 
2025-05-07T20:25:04.5785855Z 
2025-05-07T20:25:04.5785859Z 
2025-05-07T20:25:04.5785863Z 
2025-05-07T20:25:04.5785866Z 
2025-05-07T20:25:04.5785870Z 
2025-05-07T20:25:04.5785873Z 
2025-05-07T20:25:04.5787127Z 
2025-05-07T20:25:04.5792968Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.5793402Z 
2025-05-07T20:25:04.5793406Z 
2025-05-07T20:25:04.5793410Z 
2025-05-07T20:25:04.5793413Z 
2025-05-07T20:25:04.5793417Z 
2025-05-07T20:25:04.5793420Z 
2025-05-07T20:25:04.5793424Z 
2025-05-07T20:25:04.5793427Z 
2025-05-07T20:25:04.5793431Z 
2025-05-07T20:25:04.5793435Z 
2025-05-07T20:25:04.5793438Z 
2025-05-07T20:25:04.5794703Z 
2025-05-07T20:25:04.5861669Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.5862124Z 
2025-05-07T20:25:04.5967408Z openssl-3.5.0        | 3.0 MB    | ########## | 100% [A
2025-05-07T20:25:04.5967766Z 
2025-05-07T20:25:04.5967770Z 
2025-05-07T20:25:04.5967774Z 
2025-05-07T20:25:04.5967787Z 
2025-05-07T20:25:04.5967791Z 
2025-05-07T20:25:04.5967794Z 
2025-05-07T20:25:04.5967798Z 
2025-05-07T20:25:04.5967801Z 
2025-05-07T20:25:04.5967805Z 
2025-05-07T20:25:04.5967808Z 
2025-05-07T20:25:04.5967812Z 
2025-05-07T20:25:04.5967815Z 
2025-05-07T20:25:04.5968208Z 
2025-05-07T20:25:04.5974585Z libexpat-2.7.0       | 73 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.5975116Z 
2025-05-07T20:25:04.5975120Z 
2025-05-07T20:25:04.5975124Z 
2025-05-07T20:25:04.5975136Z 
2025-05-07T20:25:04.5975140Z 
2025-05-07T20:25:04.5975143Z 
2025-05-07T20:25:04.5975147Z 
2025-05-07T20:25:04.5975150Z 
2025-05-07T20:25:04.5975154Z 
2025-05-07T20:25:04.5975157Z 
2025-05-07T20:25:04.5975161Z 
2025-05-07T20:25:04.5975164Z 
2025-05-07T20:25:04.5977469Z 
2025-05-07T20:25:04.6149896Z libexpat-2.7.0       | 73 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.6150316Z 
2025-05-07T20:25:04.6150320Z 
2025-05-07T20:25:04.6150323Z 
2025-05-07T20:25:04.6150337Z 
2025-05-07T20:25:04.6150340Z 
2025-05-07T20:25:04.6150344Z 
2025-05-07T20:25:04.6150347Z 
2025-05-07T20:25:04.6150351Z 
2025-05-07T20:25:04.6150354Z 
2025-05-07T20:25:04.6150358Z 
2025-05-07T20:25:04.6150361Z 
2025-05-07T20:25:04.6150365Z 
2025-05-07T20:25:04.6150368Z 
2025-05-07T20:25:04.6150572Z 
2025-05-07T20:25:04.6156042Z libzlib-1.2.13       | 60 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.6156436Z 
2025-05-07T20:25:04.6156440Z 
2025-05-07T20:25:04.6156443Z 
2025-05-07T20:25:04.6156447Z 
2025-05-07T20:25:04.6156450Z 
2025-05-07T20:25:04.6156454Z 
2025-05-07T20:25:04.6156457Z 
2025-05-07T20:25:04.6156470Z 
2025-05-07T20:25:04.6156474Z 
2025-05-07T20:25:04.6156477Z 
2025-05-07T20:25:04.6156481Z 
2025-05-07T20:25:04.6156484Z 
2025-05-07T20:25:04.6156488Z 
2025-05-07T20:25:04.6160526Z 
2025-05-07T20:25:04.6318090Z libzlib-1.2.13       | 60 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.6318529Z 
2025-05-07T20:25:04.6318729Z 
2025-05-07T20:25:04.6318734Z 
2025-05-07T20:25:04.6318737Z 
2025-05-07T20:25:04.6318741Z 
2025-05-07T20:25:04.6318744Z 
2025-05-07T20:25:04.6318748Z 
2025-05-07T20:25:04.6318751Z 
2025-05-07T20:25:04.6318755Z 
2025-05-07T20:25:04.6318758Z 
2025-05-07T20:25:04.6318762Z 
2025-05-07T20:25:04.6318772Z 
2025-05-07T20:25:04.6318775Z 
2025-05-07T20:25:04.6318779Z 
2025-05-07T20:25:04.6318782Z 
2025-05-07T20:25:04.6327584Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.6328012Z 
2025-05-07T20:25:04.6328016Z 
2025-05-07T20:25:04.6328019Z 
2025-05-07T20:25:04.6328023Z 
2025-05-07T20:25:04.6328026Z 
2025-05-07T20:25:04.6328029Z 
2025-05-07T20:25:04.6328033Z 
2025-05-07T20:25:04.6328036Z 
2025-05-07T20:25:04.6328050Z 
2025-05-07T20:25:04.6328054Z 
2025-05-07T20:25:04.6328058Z 
2025-05-07T20:25:04.6328061Z 
2025-05-07T20:25:04.6328064Z 
2025-05-07T20:25:04.6328068Z 
2025-05-07T20:25:04.6328071Z 
2025-05-07T20:25:04.6563947Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.6564426Z 
2025-05-07T20:25:04.6564430Z 
2025-05-07T20:25:04.6564434Z 
2025-05-07T20:25:04.6564438Z 
2025-05-07T20:25:04.6564441Z 
2025-05-07T20:25:04.6564445Z 
2025-05-07T20:25:04.6564448Z 
2025-05-07T20:25:04.6564457Z 
2025-05-07T20:25:04.6564460Z 
2025-05-07T20:25:04.6564464Z 
2025-05-07T20:25:04.6564468Z 
2025-05-07T20:25:04.6564471Z 
2025-05-07T20:25:04.6564475Z 
2025-05-07T20:25:04.6564486Z 
2025-05-07T20:25:04.6564490Z 
2025-05-07T20:25:04.6564493Z 
2025-05-07T20:25:04.6564497Z 
2025-05-07T20:25:04.6574098Z libuuid-2.38.1       | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.6574516Z 
2025-05-07T20:25:04.6574519Z 
2025-05-07T20:25:04.6574523Z 
2025-05-07T20:25:04.6574526Z 
2025-05-07T20:25:04.6574530Z 
2025-05-07T20:25:04.6574533Z 
2025-05-07T20:25:04.6574537Z 
2025-05-07T20:25:04.6574540Z 
2025-05-07T20:25:04.6574543Z 
2025-05-07T20:25:04.6574554Z 
2025-05-07T20:25:04.6574558Z 
2025-05-07T20:25:04.6574561Z 
2025-05-07T20:25:04.6574564Z 
2025-05-07T20:25:04.6574568Z 
2025-05-07T20:25:04.6574571Z 
2025-05-07T20:25:04.6574575Z 
2025-05-07T20:25:04.6574578Z 
2025-05-07T20:25:04.6679722Z libuuid-2.38.1       | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.6680192Z 
2025-05-07T20:25:04.6680196Z 
2025-05-07T20:25:04.6680199Z 
2025-05-07T20:25:04.6680203Z 
2025-05-07T20:25:04.6680206Z 
2025-05-07T20:25:04.6680210Z 
2025-05-07T20:25:04.6680213Z 
2025-05-07T20:25:04.6680216Z 
2025-05-07T20:25:04.6680220Z 
2025-05-07T20:25:04.6680233Z 
2025-05-07T20:25:04.6680237Z 
2025-05-07T20:25:04.6680240Z 
2025-05-07T20:25:04.6680244Z 
2025-05-07T20:25:04.6680247Z 
2025-05-07T20:25:04.6680251Z 
2025-05-07T20:25:04.6682322Z 
2025-05-07T20:25:04.6687870Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.6688324Z 
2025-05-07T20:25:04.6688336Z 
2025-05-07T20:25:04.6688340Z 
2025-05-07T20:25:04.6688343Z 
2025-05-07T20:25:04.6688347Z 
2025-05-07T20:25:04.6688350Z 
2025-05-07T20:25:04.6688353Z 
2025-05-07T20:25:04.6688357Z 
2025-05-07T20:25:04.6688360Z 
2025-05-07T20:25:04.6688364Z 
2025-05-07T20:25:04.6688367Z 
2025-05-07T20:25:04.6688370Z 
2025-05-07T20:25:04.6688545Z 
2025-05-07T20:25:04.6688548Z 
2025-05-07T20:25:04.6688551Z 
2025-05-07T20:25:04.6688555Z 
2025-05-07T20:25:04.6688558Z 
2025-05-07T20:25:04.6688562Z 
2025-05-07T20:25:04.6691677Z libnsl-2.0.1         | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.6692113Z 
2025-05-07T20:25:04.6692118Z 
2025-05-07T20:25:04.6692123Z 
2025-05-07T20:25:04.6692128Z 
2025-05-07T20:25:04.6692132Z 
2025-05-07T20:25:04.6692148Z 
2025-05-07T20:25:04.6692152Z 
2025-05-07T20:25:04.6692155Z 
2025-05-07T20:25:04.6692159Z 
2025-05-07T20:25:04.6692162Z 
2025-05-07T20:25:04.6692166Z 
2025-05-07T20:25:04.6692169Z 
2025-05-07T20:25:04.6692173Z 
2025-05-07T20:25:04.6692339Z 
2025-05-07T20:25:04.6692343Z 
2025-05-07T20:25:04.6692374Z 
2025-05-07T20:25:04.6695581Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.6696010Z 
2025-05-07T20:25:04.6696014Z 
2025-05-07T20:25:04.6696017Z 
2025-05-07T20:25:04.6696029Z 
2025-05-07T20:25:04.6696033Z 
2025-05-07T20:25:04.6696036Z 
2025-05-07T20:25:04.6696040Z 
2025-05-07T20:25:04.6696043Z 
2025-05-07T20:25:04.6696047Z 
2025-05-07T20:25:04.6696050Z 
2025-05-07T20:25:04.6696054Z 
2025-05-07T20:25:04.6696057Z 
2025-05-07T20:25:04.6696061Z 
2025-05-07T20:25:04.6696065Z 
2025-05-07T20:25:04.6696068Z 
2025-05-07T20:25:04.6696082Z 
2025-05-07T20:25:04.6696085Z 
2025-05-07T20:25:04.6696089Z 
2025-05-07T20:25:04.6740057Z libnsl-2.0.1         | 33 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:04.6740468Z 
2025-05-07T20:25:04.6740474Z 
2025-05-07T20:25:04.6740479Z 
2025-05-07T20:25:04.6740484Z 
2025-05-07T20:25:04.6740500Z 
2025-05-07T20:25:04.6740506Z 
2025-05-07T20:25:04.6740511Z 
2025-05-07T20:25:04.6740517Z 
2025-05-07T20:25:04.6740522Z 
2025-05-07T20:25:04.6740527Z 
2025-05-07T20:25:04.6740532Z 
2025-05-07T20:25:04.6740537Z 
2025-05-07T20:25:04.6740542Z 
2025-05-07T20:25:04.6740547Z 
2025-05-07T20:25:04.6740559Z 
2025-05-07T20:25:04.6740564Z 
2025-05-07T20:25:04.6740570Z 
2025-05-07T20:25:04.6740575Z 
2025-05-07T20:25:04.6740580Z 
2025-05-07T20:25:04.6873266Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.3777071Z python-3.12.2        | 30.8 MB   | ########## | 100% 
2025-05-07T20:25:05.3785129Z python-3.12.2        | 30.8 MB   | ########## | 100% 
2025-05-07T20:25:05.3785637Z 
2025-05-07T20:25:05.3785646Z 
2025-05-07T20:25:05.3785653Z 
2025-05-07T20:25:05.3785661Z 
2025-05-07T20:25:05.3785668Z 
2025-05-07T20:25:05.3785675Z 
2025-05-07T20:25:05.3785682Z 
2025-05-07T20:25:05.3785689Z 
2025-05-07T20:25:05.3785714Z 
2025-05-07T20:25:05.3785721Z 
2025-05-07T20:25:05.3785764Z 
2025-05-07T20:25:05.3785772Z 
2025-05-07T20:25:05.3785779Z 
2025-05-07T20:25:05.3785786Z 
2025-05-07T20:25:05.3785794Z 
2025-05-07T20:25:05.3785802Z 
2025-05-07T20:25:05.3785809Z 
2025-05-07T20:25:05.3785816Z 
2025-05-07T20:25:05.3785823Z 
2025-05-07T20:25:05.3785992Z                       
2025-05-07T20:25:05.3786867Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.3787528Z                                                      
2025-05-07T20:25:05.3787933Z 
2025-05-07T20:25:05.3788263Z                                                      [A
2025-05-07T20:25:05.3788692Z 
2025-05-07T20:25:05.3788700Z 
2025-05-07T20:25:05.3789028Z                                                      [A[A
2025-05-07T20:25:05.3789346Z 
2025-05-07T20:25:05.3789350Z 
2025-05-07T20:25:05.3789354Z 
2025-05-07T20:25:05.3789533Z                                                      [A[A[A
2025-05-07T20:25:05.3789750Z 
2025-05-07T20:25:05.3789763Z 
2025-05-07T20:25:05.3789767Z 
2025-05-07T20:25:05.3789777Z 
2025-05-07T20:25:05.3789956Z                                                      [A[A[A[A
2025-05-07T20:25:05.3790174Z 
2025-05-07T20:25:05.3790178Z 
2025-05-07T20:25:05.3790181Z 
2025-05-07T20:25:05.3790184Z 
2025-05-07T20:25:05.3790200Z 
2025-05-07T20:25:05.3790378Z                                                      [A[A[A[A[A
2025-05-07T20:25:05.3790851Z 
2025-05-07T20:25:05.3790898Z 
2025-05-07T20:25:05.3790902Z 
2025-05-07T20:25:05.3790906Z 
2025-05-07T20:25:05.3790909Z 
2025-05-07T20:25:05.3790913Z 
2025-05-07T20:25:05.3791104Z                                                      [A[A[A[A[A[A
2025-05-07T20:25:05.3791340Z 
2025-05-07T20:25:05.3791344Z 
2025-05-07T20:25:05.3791347Z 
2025-05-07T20:25:05.3791351Z 
2025-05-07T20:25:05.3791354Z 
2025-05-07T20:25:05.3791357Z 
2025-05-07T20:25:05.3791361Z 
2025-05-07T20:25:05.3791546Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:25:05.3791787Z 
2025-05-07T20:25:05.3791790Z 
2025-05-07T20:25:05.3791958Z 
2025-05-07T20:25:05.3791962Z 
2025-05-07T20:25:05.3791966Z 
2025-05-07T20:25:05.3791969Z 
2025-05-07T20:25:05.3791972Z 
2025-05-07T20:25:05.3791976Z 
2025-05-07T20:25:05.3792171Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:25:05.3792409Z 
2025-05-07T20:25:05.3792422Z 
2025-05-07T20:25:05.3792426Z 
2025-05-07T20:25:05.3792429Z 
2025-05-07T20:25:05.3792433Z 
2025-05-07T20:25:05.3792436Z 
2025-05-07T20:25:05.3792440Z 
2025-05-07T20:25:05.3792443Z 
2025-05-07T20:25:05.3792446Z 
2025-05-07T20:25:05.3792641Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.3792874Z 
2025-05-07T20:25:05.3792878Z 
2025-05-07T20:25:05.3792881Z 
2025-05-07T20:25:05.3792885Z 
2025-05-07T20:25:05.3792896Z 
2025-05-07T20:25:05.3792899Z 
2025-05-07T20:25:05.3792903Z 
2025-05-07T20:25:05.3792906Z 
2025-05-07T20:25:05.3792910Z 
2025-05-07T20:25:05.3792913Z 
2025-05-07T20:25:05.3793112Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.3793353Z 
2025-05-07T20:25:05.3793356Z 
2025-05-07T20:25:05.3793360Z 
2025-05-07T20:25:05.3793363Z 
2025-05-07T20:25:05.3793367Z 
2025-05-07T20:25:05.3793370Z 
2025-05-07T20:25:05.3793374Z 
2025-05-07T20:25:05.3793377Z 
2025-05-07T20:25:05.3793385Z 
2025-05-07T20:25:05.3793388Z 
2025-05-07T20:25:05.3793392Z 
2025-05-07T20:25:05.3793592Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.3793834Z 
2025-05-07T20:25:05.3793838Z 
2025-05-07T20:25:05.3793842Z 
2025-05-07T20:25:05.3793845Z 
2025-05-07T20:25:05.3793849Z 
2025-05-07T20:25:05.3793852Z 
2025-05-07T20:25:05.3793855Z 
2025-05-07T20:25:05.3793859Z 
2025-05-07T20:25:05.3793862Z 
2025-05-07T20:25:05.3793866Z 
2025-05-07T20:25:05.3793869Z 
2025-05-07T20:25:05.3793873Z 
2025-05-07T20:25:05.3794077Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.3794316Z 
2025-05-07T20:25:05.3794325Z 
2025-05-07T20:25:05.3794328Z 
2025-05-07T20:25:05.3794331Z 
2025-05-07T20:25:05.3794335Z 
2025-05-07T20:25:05.3794338Z 
2025-05-07T20:25:05.3794361Z 
2025-05-07T20:25:05.3794374Z 
2025-05-07T20:25:05.3794377Z 
2025-05-07T20:25:05.3794381Z 
2025-05-07T20:25:05.3794384Z 
2025-05-07T20:25:05.3794388Z 
2025-05-07T20:25:05.3794396Z 
2025-05-07T20:25:05.3794596Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.3794828Z 
2025-05-07T20:25:05.3794840Z 
2025-05-07T20:25:05.3794843Z 
2025-05-07T20:25:05.3794847Z 
2025-05-07T20:25:05.3794850Z 
2025-05-07T20:25:05.3794854Z 
2025-05-07T20:25:05.3794857Z 
2025-05-07T20:25:05.3794861Z 
2025-05-07T20:25:05.3794864Z 
2025-05-07T20:25:05.3794868Z 
2025-05-07T20:25:05.3794871Z 
2025-05-07T20:25:05.3794875Z 
2025-05-07T20:25:05.3794878Z 
2025-05-07T20:25:05.3794882Z 
2025-05-07T20:25:05.3795088Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.3795340Z 
2025-05-07T20:25:05.3795344Z 
2025-05-07T20:25:05.3795347Z 
2025-05-07T20:25:05.3795351Z 
2025-05-07T20:25:05.3795354Z 
2025-05-07T20:25:05.3795357Z 
2025-05-07T20:25:05.3795361Z 
2025-05-07T20:25:05.3795364Z 
2025-05-07T20:25:05.3795368Z 
2025-05-07T20:25:05.3795371Z 
2025-05-07T20:25:05.3795460Z 
2025-05-07T20:25:05.3795464Z 
2025-05-07T20:25:05.3795467Z 
2025-05-07T20:25:05.3795471Z 
2025-05-07T20:25:05.3795474Z 
2025-05-07T20:25:05.3795693Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.3795933Z 
2025-05-07T20:25:05.3795937Z 
2025-05-07T20:25:05.3795940Z 
2025-05-07T20:25:05.3795944Z 
2025-05-07T20:25:05.3795947Z 
2025-05-07T20:25:05.3795951Z 
2025-05-07T20:25:05.3795954Z 
2025-05-07T20:25:05.3795958Z 
2025-05-07T20:25:05.3795969Z 
2025-05-07T20:25:05.3795972Z 
2025-05-07T20:25:05.3795976Z 
2025-05-07T20:25:05.3795979Z 
2025-05-07T20:25:05.3795983Z 
2025-05-07T20:25:05.3795986Z 
2025-05-07T20:25:05.3795989Z 
2025-05-07T20:25:05.3796076Z 
2025-05-07T20:25:05.3796287Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.3796538Z 
2025-05-07T20:25:05.3796541Z 
2025-05-07T20:25:05.3796544Z 
2025-05-07T20:25:05.3796548Z 
2025-05-07T20:25:05.3796551Z 
2025-05-07T20:25:05.3796562Z 
2025-05-07T20:25:05.3796565Z 
2025-05-07T20:25:05.3796569Z 
2025-05-07T20:25:05.3796572Z 
2025-05-07T20:25:05.3796576Z 
2025-05-07T20:25:05.3796579Z 
2025-05-07T20:25:05.3796583Z 
2025-05-07T20:25:05.3796586Z 
2025-05-07T20:25:05.3796590Z 
2025-05-07T20:25:05.3796593Z 
2025-05-07T20:25:05.3796597Z 
2025-05-07T20:25:05.3796600Z 
2025-05-07T20:25:05.3796830Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.3797073Z 
2025-05-07T20:25:05.3797077Z 
2025-05-07T20:25:05.3797080Z 
2025-05-07T20:25:05.3797083Z 
2025-05-07T20:25:05.3797087Z 
2025-05-07T20:25:05.3797090Z 
2025-05-07T20:25:05.3797094Z 
2025-05-07T20:25:05.3797103Z 
2025-05-07T20:25:05.3797107Z 
2025-05-07T20:25:05.3797110Z 
2025-05-07T20:25:05.3797114Z 
2025-05-07T20:25:05.3797127Z 
2025-05-07T20:25:05.3797131Z 
2025-05-07T20:25:05.3797134Z 
2025-05-07T20:25:05.3797137Z 
2025-05-07T20:25:05.3797141Z 
2025-05-07T20:25:05.3797144Z 
2025-05-07T20:25:05.3797148Z 
2025-05-07T20:25:05.3797376Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:05.3797636Z 
2025-05-07T20:25:05.3797718Z  done
2025-05-07T20:25:05.4794709Z Preparing transaction: \ done
2025-05-07T20:25:06.2344450Z Verifying transaction: / - \ | / - done
2025-05-07T20:25:07.9400342Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:25:08.2996878Z [SETUP] Testing pyOpenSSL import ...
2025-05-07T20:25:10.0596901Z [CHECK] Python (sub-)package 'OpenSSL' found ...
2025-05-07T20:25:10.0610518Z [SETUP] Installing libxcrypt ...
2025-05-07T20:25:10.0635696Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt
2025-05-07T20:25:10.9275511Z Channels:
2025-05-07T20:25:10.9275924Z  - conda-forge
2025-05-07T20:25:10.9276228Z Platform: linux-64
2025-05-07T20:25:14.3680634Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:14.7385468Z Solving environment: \ done
2025-05-07T20:25:14.7750521Z 
2025-05-07T20:25:14.7750833Z # All requested packages already installed.
2025-05-07T20:25:14.7751091Z 
2025-05-07T20:25:18.1871424Z [SETUP] Copying <crypt.h> over ...
2025-05-07T20:25:18.1872212Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.12/crypt.h
2025-05-07T20:25:18.1872795Z 
2025-05-07T20:25:18.1906765Z 
2025-05-07T20:25:19.8421284Z [SETUP] Installed Python version: Python 3.12.2
2025-05-07T20:25:19.8421748Z [SETUP] Successfully created Conda environment: build_binary
2025-05-07T20:25:19.8456382Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc
2025-05-07T20:25:19.8456868Z [36;1m. $PRELUDE; install_cxx_compiler $BUILD_ENV gcc[0m
2025-05-07T20:25:19.8470513Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:19.8470870Z env:
2025-05-07T20:25:19.8471090Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:19.8471622Z   BUILD_ENV: build_binary
2025-05-07T20:25:19.8471883Z   BUILD_TARGET: genai
2025-05-07T20:25:19.8472118Z   BUILD_VARIANT: cuda
2025-05-07T20:25:19.8472369Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:25:19.8472634Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:19.8472952Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:19.8473331Z ##[endgroup]
2025-05-07T20:25:20.1864782Z ################################################################################
2025-05-07T20:25:20.1865251Z # Install C/C++ Compilers
2025-05-07T20:25:20.1865520Z #
2025-05-07T20:25:20.1880840Z # [2025-05-07T20:25:20.187Z] + install_cxx_compiler build_binary gcc
2025-05-07T20:25:20.1881276Z ################################################################################
2025-05-07T20:25:20.1881512Z 
2025-05-07T20:25:20.1898115Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:20.2770330Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:20.2780148Z [INSTALL] Installing GLIBC (architecture = 64) ...
2025-05-07T20:25:20.2800708Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17
2025-05-07T20:25:21.1508293Z Channels:
2025-05-07T20:25:21.1508566Z  - conda-forge
2025-05-07T20:25:21.1508812Z Platform: linux-64
2025-05-07T20:25:24.5883301Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:24.9598218Z Solving environment: \ done
2025-05-07T20:25:25.0233864Z 
2025-05-07T20:25:25.0234160Z ## Package Plan ##
2025-05-07T20:25:25.0234304Z 
2025-05-07T20:25:25.0234541Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:25.0234872Z 
2025-05-07T20:25:25.0234979Z   added / updated specs:
2025-05-07T20:25:25.0235273Z     - sysroot_linux-64=2.17
2025-05-07T20:25:25.0235466Z 
2025-05-07T20:25:25.0235470Z 
2025-05-07T20:25:25.0235592Z The following packages will be downloaded:
2025-05-07T20:25:25.0235812Z 
2025-05-07T20:25:25.0235953Z     package                    |            build
2025-05-07T20:25:25.0236278Z     ---------------------------|-----------------
2025-05-07T20:25:25.0236717Z     kernel-headers_linux-64-3.10.0|      he073ed8_18         921 KB  conda-forge
2025-05-07T20:25:25.0237221Z     sysroot_linux-64-2.17      |      h0157908_18        14.5 MB  conda-forge
2025-05-07T20:25:25.0237645Z     ------------------------------------------------------------
2025-05-07T20:25:25.0237995Z                                            Total:        15.4 MB
2025-05-07T20:25:25.0238221Z 
2025-05-07T20:25:25.0238350Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:25.0238583Z 
2025-05-07T20:25:25.0238880Z   kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 
2025-05-07T20:25:25.0239475Z   sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 
2025-05-07T20:25:25.0239794Z 
2025-05-07T20:25:25.0239798Z 
2025-05-07T20:25:25.0239802Z 
2025-05-07T20:25:25.0239951Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:25.0240349Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:25:25.0240596Z 
2025-05-07T20:25:25.2247151Z kernel-headers_linux | 921 KB    |            |   0% [A
2025-05-07T20:25:25.2660207Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:25:25.2660462Z 
2025-05-07T20:25:25.2974484Z kernel-headers_linux | 921 KB    | 1          |   2% [A
2025-05-07T20:25:25.2975157Z 
2025-05-07T20:25:25.3674746Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:25.4675919Z sysroot_linux-64-2.1 | 14.5 MB   | 6          |   6% 
2025-05-07T20:25:25.5123950Z sysroot_linux-64-2.1 | 14.5 MB   | ######2    |  62% 
2025-05-07T20:25:25.5124234Z 
2025-05-07T20:25:25.5124672Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:25.5124960Z 
2025-05-07T20:25:25.5599000Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:26.0839536Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:26.0840387Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:26.0844814Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:26.0845326Z                                                      
2025-05-07T20:25:26.0845644Z 
2025-05-07T20:25:26.0846008Z                                                      [A done
2025-05-07T20:25:26.1850350Z Preparing transaction: / done
2025-05-07T20:25:26.3858152Z Verifying transaction: \ | done
2025-05-07T20:25:26.5907753Z Executing transaction: - \ done
2025-05-07T20:25:26.7480893Z [CHECK] LD_LIBRARY_PATH = 
2025-05-07T20:25:26.7481281Z [CHECK] CONDA_PREFIX is not set.
2025-05-07T20:25:28.4425969Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6
2025-05-07T20:25:28.4439908Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ...
2025-05-07T20:25:28.4461248Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0
2025-05-07T20:25:29.3339658Z Channels:
2025-05-07T20:25:29.3339894Z  - conda-forge
2025-05-07T20:25:29.3340136Z Platform: linux-64
2025-05-07T20:25:32.7714400Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:33.7516291Z Solving environment: \ | / done
2025-05-07T20:25:33.8178617Z 
2025-05-07T20:25:33.8178963Z ## Package Plan ##
2025-05-07T20:25:33.8179124Z 
2025-05-07T20:25:33.8179335Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:33.8179690Z 
2025-05-07T20:25:33.8179787Z   added / updated specs:
2025-05-07T20:25:33.8180075Z     - gxx_linux-64=11.4.0
2025-05-07T20:25:33.8180241Z 
2025-05-07T20:25:33.8180245Z 
2025-05-07T20:25:33.8180377Z The following packages will be downloaded:
2025-05-07T20:25:33.8180600Z 
2025-05-07T20:25:33.8180717Z     package                    |            build
2025-05-07T20:25:33.8181052Z     ---------------------------|-----------------
2025-05-07T20:25:33.8181487Z     binutils_impl_linux-64-2.40|       ha1999f0_7         6.0 MB  conda-forge
2025-05-07T20:25:33.8181975Z     binutils_linux-64-2.40     |       hb3c18ed_4          28 KB  conda-forge
2025-05-07T20:25:33.8182449Z     gcc_impl_linux-64-11.4.0   |      h00c12a0_13        53.0 MB  conda-forge
2025-05-07T20:25:33.8182897Z     gcc_linux-64-11.4.0        |       ha077dfb_4          31 KB  conda-forge
2025-05-07T20:25:33.8183347Z     gxx_impl_linux-64-11.4.0   |      h634f3ee_13        11.2 MB  conda-forge
2025-05-07T20:25:33.8183788Z     gxx_linux-64-11.4.0        |       h35bfe5d_4          29 KB  conda-forge
2025-05-07T20:25:33.8184237Z     ld_impl_linux-64-2.40      |       hf3520f5_7         691 KB  conda-forge
2025-05-07T20:25:33.8184719Z     libgcc-devel_linux-64-11.4.0|     h8f596e0_113         2.3 MB  conda-forge
2025-05-07T20:25:33.8185209Z     libsanitizer-11.4.0        |      h5763a12_13         3.5 MB  conda-forge
2025-05-07T20:25:33.8185655Z     libstdcxx-15.1.0           |       h8f9b012_2         3.7 MB  conda-forge
2025-05-07T20:25:33.8186146Z     libstdcxx-devel_linux-64-11.4.0|     h8f596e0_113        11.1 MB  conda-forge
2025-05-07T20:25:33.8186647Z     libstdcxx-ng-15.1.0        |       h4852527_2          34 KB  conda-forge
2025-05-07T20:25:33.8187107Z     ------------------------------------------------------------
2025-05-07T20:25:33.8187462Z                                            Total:        91.6 MB
2025-05-07T20:25:33.8187678Z 
2025-05-07T20:25:33.8187814Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:33.8188038Z 
2025-05-07T20:25:33.8188581Z   binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 
2025-05-07T20:25:33.8189165Z   binutils_linux-64  conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 
2025-05-07T20:25:33.8189736Z   gcc_impl_linux-64  conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 
2025-05-07T20:25:33.8190274Z   gcc_linux-64       conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 
2025-05-07T20:25:33.8190954Z   gxx_impl_linux-64  conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 
2025-05-07T20:25:33.8191476Z   gxx_linux-64       conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 
2025-05-07T20:25:33.8192026Z   libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:25:33.8192611Z   libsanitizer       conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 
2025-05-07T20:25:33.8193137Z   libstdcxx          conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 
2025-05-07T20:25:33.8193698Z   libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:25:33.8194091Z 
2025-05-07T20:25:33.8194209Z The following packages will be UPDATED:
2025-05-07T20:25:33.8194423Z 
2025-05-07T20:25:33.8194750Z   ld_impl_linux-64   pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 
2025-05-07T20:25:33.8195493Z   libstdcxx-ng       pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 
2025-05-07T20:25:33.8195925Z 
2025-05-07T20:25:33.8195929Z 
2025-05-07T20:25:33.8195933Z 
2025-05-07T20:25:33.8196089Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:33.8196478Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:25:33.8196720Z 
2025-05-07T20:25:33.8196969Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:25:33.8197210Z 
2025-05-07T20:25:33.8197214Z 
2025-05-07T20:25:33.8198055Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:25:33.8198318Z 
2025-05-07T20:25:33.8198329Z 
2025-05-07T20:25:33.8198332Z 
2025-05-07T20:25:33.8207320Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:25:33.8207602Z 
2025-05-07T20:25:33.8207606Z 
2025-05-07T20:25:33.8207610Z 
2025-05-07T20:25:33.8208213Z 
2025-05-07T20:25:33.8249198Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:25:33.8249489Z 
2025-05-07T20:25:33.8249493Z 
2025-05-07T20:25:33.8249496Z 
2025-05-07T20:25:33.8249509Z 
2025-05-07T20:25:33.8249512Z 
2025-05-07T20:25:33.8250802Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:33.8251100Z 
2025-05-07T20:25:33.8251103Z 
2025-05-07T20:25:33.8251107Z 
2025-05-07T20:25:33.8251111Z 
2025-05-07T20:25:33.8251114Z 
2025-05-07T20:25:33.8251118Z 
2025-05-07T20:25:33.8252507Z libgcc-devel_linux-6 | 2.3 MB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:33.8252818Z 
2025-05-07T20:25:33.8252821Z 
2025-05-07T20:25:33.8252825Z 
2025-05-07T20:25:33.8252828Z 
2025-05-07T20:25:33.8252832Z 
2025-05-07T20:25:33.8252836Z 
2025-05-07T20:25:33.8252839Z 
2025-05-07T20:25:33.8260718Z ld_impl_linux-64-2.4 | 691 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:33.8261118Z 
2025-05-07T20:25:33.8261124Z 
2025-05-07T20:25:33.8261129Z 
2025-05-07T20:25:33.8261134Z 
2025-05-07T20:25:33.8261138Z 
2025-05-07T20:25:33.8261144Z 
2025-05-07T20:25:33.8261149Z 
2025-05-07T20:25:33.8267074Z 
2025-05-07T20:25:33.8268055Z libstdcxx-ng-15.1.0  | 34 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:33.8268358Z 
2025-05-07T20:25:33.8268362Z 
2025-05-07T20:25:33.8268378Z 
2025-05-07T20:25:33.8268390Z 
2025-05-07T20:25:33.8268394Z 
2025-05-07T20:25:33.8268398Z 
2025-05-07T20:25:33.8268401Z 
2025-05-07T20:25:33.8268404Z 
2025-05-07T20:25:33.8268408Z 
2025-05-07T20:25:33.8274651Z gcc_linux-64-11.4.0  | 31 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:33.8274962Z 
2025-05-07T20:25:33.8274965Z 
2025-05-07T20:25:33.8274969Z 
2025-05-07T20:25:33.8274972Z 
2025-05-07T20:25:33.8274976Z 
2025-05-07T20:25:33.8274979Z 
2025-05-07T20:25:33.8275195Z 
2025-05-07T20:25:33.8275200Z 
2025-05-07T20:25:33.8275203Z 
2025-05-07T20:25:33.8275207Z 
2025-05-07T20:25:33.8275816Z gxx_linux-64-11.4.0  | 29 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:33.8276149Z 
2025-05-07T20:25:33.8276155Z 
2025-05-07T20:25:33.8276160Z 
2025-05-07T20:25:33.8276175Z 
2025-05-07T20:25:33.8276180Z 
2025-05-07T20:25:33.8276364Z 
2025-05-07T20:25:33.8276368Z 
2025-05-07T20:25:33.8276371Z 
2025-05-07T20:25:33.8276375Z 
2025-05-07T20:25:33.8276378Z 
2025-05-07T20:25:33.8276382Z 
2025-05-07T20:25:34.2020828Z binutils_linux-64-2. | 28 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:34.2021165Z 
2025-05-07T20:25:34.2088394Z 
2025-05-07T20:25:34.2510897Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:25:34.2511319Z 
2025-05-07T20:25:34.2511327Z 
2025-05-07T20:25:34.2511332Z 
2025-05-07T20:25:34.2511704Z 
2025-05-07T20:25:34.2520026Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:25:34.2773106Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:25:34.2773572Z 
2025-05-07T20:25:34.3020777Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:25:34.3021044Z 
2025-05-07T20:25:34.3021049Z 
2025-05-07T20:25:34.3500185Z libstdcxx-devel_linu | 11.1 MB   | ##7        |  28% [A[A
2025-05-07T20:25:34.3500512Z 
2025-05-07T20:25:34.3500539Z 
2025-05-07T20:25:34.3501370Z 
2025-05-07T20:25:34.3520273Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:25:34.3772381Z gcc_impl_linux-64-11 | 53.0 MB   | #2         |  13% 
2025-05-07T20:25:34.3772701Z 
2025-05-07T20:25:34.3772797Z 
2025-05-07T20:25:34.3772802Z 
2025-05-07T20:25:34.3778863Z 
2025-05-07T20:25:34.3786005Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:34.3786388Z 
2025-05-07T20:25:34.3786394Z 
2025-05-07T20:25:34.3786399Z 
2025-05-07T20:25:34.3794591Z 
2025-05-07T20:25:34.3805027Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:34.3806512Z 
2025-05-07T20:25:34.4023962Z gxx_impl_linux-64-11 | 11.2 MB   | ####2      |  42% [A
2025-05-07T20:25:34.4024315Z 
2025-05-07T20:25:34.4025066Z 
2025-05-07T20:25:34.4213267Z libstdcxx-devel_linu | 11.1 MB   | #######2   |  73% [A[A
2025-05-07T20:25:34.4213659Z 
2025-05-07T20:25:34.4213665Z 
2025-05-07T20:25:34.4213670Z 
2025-05-07T20:25:34.4213692Z 
2025-05-07T20:25:34.4214485Z 
2025-05-07T20:25:34.4500818Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:34.4501240Z 
2025-05-07T20:25:34.4501245Z 
2025-05-07T20:25:34.4501250Z 
2025-05-07T20:25:34.4523508Z binutils_impl_linux- | 6.0 MB    | ####9      |  50% [A[A[A
2025-05-07T20:25:34.4782987Z gcc_impl_linux-64-11 | 53.0 MB   | ##         |  21% 
2025-05-07T20:25:34.4785010Z 
2025-05-07T20:25:34.5218388Z gxx_impl_linux-64-11 | 11.2 MB   | #######1   |  71% [A
2025-05-07T20:25:34.5218749Z 
2025-05-07T20:25:34.5218755Z 
2025-05-07T20:25:34.5218761Z 
2025-05-07T20:25:34.5218784Z 
2025-05-07T20:25:34.5219783Z 
2025-05-07T20:25:34.5597330Z libsanitizer-11.4.0  | 3.5 MB    | ########8  |  89% [A[A[A[A[A
2025-05-07T20:25:34.6244461Z gcc_impl_linux-64-11 | 53.0 MB   | ##8        |  28% 
2025-05-07T20:25:34.6244815Z 
2025-05-07T20:25:34.6244821Z 
2025-05-07T20:25:34.6244826Z 
2025-05-07T20:25:34.6244831Z 
2025-05-07T20:25:34.6250059Z 
2025-05-07T20:25:34.6597867Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:34.6800577Z gcc_impl_linux-64-11 | 53.0 MB   | ###6       |  37% 
2025-05-07T20:25:34.6800941Z 
2025-05-07T20:25:34.6801072Z 
2025-05-07T20:25:34.6801078Z 
2025-05-07T20:25:34.6801083Z 
2025-05-07T20:25:34.6801088Z 
2025-05-07T20:25:34.6803529Z 
2025-05-07T20:25:34.6984470Z libgcc-devel_linux-6 | 2.3 MB    |            |   1% [A[A[A[A[A[A
2025-05-07T20:25:34.6984882Z 
2025-05-07T20:25:34.6984896Z 
2025-05-07T20:25:34.6987721Z 
2025-05-07T20:25:34.6995844Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:34.6996496Z 
2025-05-07T20:25:34.6996503Z 
2025-05-07T20:25:34.6997960Z 
2025-05-07T20:25:34.7579812Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:34.7580201Z 
2025-05-07T20:25:34.7580207Z 
2025-05-07T20:25:34.7580212Z 
2025-05-07T20:25:34.7580217Z 
2025-05-07T20:25:34.7580232Z 
2025-05-07T20:25:34.7580237Z 
2025-05-07T20:25:34.7582341Z 
2025-05-07T20:25:34.7861124Z ld_impl_linux-64-2.4 | 691 KB    | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:25:34.7926363Z gcc_impl_linux-64-11 | 53.0 MB   | ####4      |  45% 
2025-05-07T20:25:34.7926716Z 
2025-05-07T20:25:34.7928906Z 
2025-05-07T20:25:34.8097006Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:25:34.8097451Z 
2025-05-07T20:25:34.8097457Z 
2025-05-07T20:25:34.8097461Z 
2025-05-07T20:25:34.8097467Z 
2025-05-07T20:25:34.8097472Z 
2025-05-07T20:25:34.8097477Z 
2025-05-07T20:25:34.8099134Z 
2025-05-07T20:25:34.8422456Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:34.8422867Z 
2025-05-07T20:25:34.8422873Z 
2025-05-07T20:25:34.8422878Z 
2025-05-07T20:25:34.8422883Z 
2025-05-07T20:25:34.8422887Z 
2025-05-07T20:25:34.8422892Z 
2025-05-07T20:25:34.8426233Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:34.8426634Z 
2025-05-07T20:25:34.8426640Z 
2025-05-07T20:25:34.8426644Z 
2025-05-07T20:25:34.8426663Z 
2025-05-07T20:25:34.8426667Z 
2025-05-07T20:25:34.8427412Z 
2025-05-07T20:25:34.8604370Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:34.8604806Z 
2025-05-07T20:25:34.8604811Z 
2025-05-07T20:25:34.8604817Z 
2025-05-07T20:25:34.8604822Z 
2025-05-07T20:25:34.8604835Z 
2025-05-07T20:25:34.8604840Z 
2025-05-07T20:25:34.8604845Z 
2025-05-07T20:25:34.8605610Z 
2025-05-07T20:25:34.8647169Z libstdcxx-ng-15.1.0  | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:25:34.8647580Z 
2025-05-07T20:25:34.8647586Z 
2025-05-07T20:25:34.8647591Z 
2025-05-07T20:25:34.8647608Z 
2025-05-07T20:25:34.8647613Z 
2025-05-07T20:25:34.8647618Z 
2025-05-07T20:25:34.8647623Z 
2025-05-07T20:25:34.8647632Z 
2025-05-07T20:25:34.8758744Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:34.8759157Z 
2025-05-07T20:25:34.8759162Z 
2025-05-07T20:25:34.8759167Z 
2025-05-07T20:25:34.8761733Z 
2025-05-07T20:25:34.8786279Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:34.8786620Z 
2025-05-07T20:25:34.8786925Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:25:34.8787191Z 
2025-05-07T20:25:34.8801999Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:25:34.8802291Z 
2025-05-07T20:25:34.8802301Z 
2025-05-07T20:25:34.8802305Z 
2025-05-07T20:25:34.8802309Z 
2025-05-07T20:25:34.8802312Z 
2025-05-07T20:25:34.8802316Z 
2025-05-07T20:25:34.8802320Z 
2025-05-07T20:25:34.8802324Z 
2025-05-07T20:25:34.8802821Z 
2025-05-07T20:25:34.8838655Z gcc_linux-64-11.4.0  | 31 KB     | #####2     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:34.8838962Z 
2025-05-07T20:25:34.8838966Z 
2025-05-07T20:25:34.8838970Z 
2025-05-07T20:25:34.8838973Z 
2025-05-07T20:25:34.8838977Z 
2025-05-07T20:25:34.8838980Z 
2025-05-07T20:25:34.8838984Z 
2025-05-07T20:25:34.8838987Z 
2025-05-07T20:25:34.8842329Z 
2025-05-07T20:25:34.8876897Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:34.8877196Z 
2025-05-07T20:25:34.8877200Z 
2025-05-07T20:25:34.8877204Z 
2025-05-07T20:25:34.8877207Z 
2025-05-07T20:25:34.8877211Z 
2025-05-07T20:25:34.8877214Z 
2025-05-07T20:25:34.8877218Z 
2025-05-07T20:25:34.8877221Z 
2025-05-07T20:25:34.8877225Z 
2025-05-07T20:25:34.8877481Z 
2025-05-07T20:25:34.8893318Z gxx_linux-64-11.4.0  | 29 KB     | #####5     |  55% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:34.8914658Z gcc_impl_linux-64-11 | 53.0 MB   | #####2     |  52% 
2025-05-07T20:25:34.8914935Z 
2025-05-07T20:25:34.8914942Z 
2025-05-07T20:25:34.8914947Z 
2025-05-07T20:25:34.8915203Z 
2025-05-07T20:25:34.8915212Z 
2025-05-07T20:25:34.8915217Z 
2025-05-07T20:25:34.8915223Z 
2025-05-07T20:25:34.8915229Z 
2025-05-07T20:25:34.8915235Z 
2025-05-07T20:25:34.8916618Z 
2025-05-07T20:25:34.9034685Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:34.9035002Z 
2025-05-07T20:25:34.9035007Z 
2025-05-07T20:25:34.9035552Z 
2025-05-07T20:25:34.9035556Z 
2025-05-07T20:25:34.9035560Z 
2025-05-07T20:25:34.9035563Z 
2025-05-07T20:25:34.9035567Z 
2025-05-07T20:25:34.9035570Z 
2025-05-07T20:25:34.9035574Z 
2025-05-07T20:25:34.9035577Z 
2025-05-07T20:25:34.9036153Z 
2025-05-07T20:25:34.9074169Z binutils_linux-64-2. | 28 KB     | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:34.9074493Z 
2025-05-07T20:25:34.9074497Z 
2025-05-07T20:25:34.9074500Z 
2025-05-07T20:25:34.9074504Z 
2025-05-07T20:25:34.9074507Z 
2025-05-07T20:25:34.9074511Z 
2025-05-07T20:25:34.9074514Z 
2025-05-07T20:25:34.9074518Z 
2025-05-07T20:25:34.9074521Z 
2025-05-07T20:25:34.9074536Z 
2025-05-07T20:25:34.9075160Z 
2025-05-07T20:25:34.9895516Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:35.0575762Z gcc_impl_linux-64-11 | 53.0 MB   | ######6    |  67% 
2025-05-07T20:25:35.0576014Z 
2025-05-07T20:25:35.0576018Z 
2025-05-07T20:25:35.0576022Z 
2025-05-07T20:25:35.0576025Z 
2025-05-07T20:25:35.0576458Z 
2025-05-07T20:25:35.0896609Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:35.1326998Z gcc_impl_linux-64-11 | 53.0 MB   | ########1  |  82% 
2025-05-07T20:25:35.1327476Z 
2025-05-07T20:25:35.1327484Z 
2025-05-07T20:25:35.1327491Z 
2025-05-07T20:25:35.1327498Z 
2025-05-07T20:25:35.1327506Z 
2025-05-07T20:25:35.1327511Z 
2025-05-07T20:25:35.1327514Z 
2025-05-07T20:25:35.1335908Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:35.1336195Z 
2025-05-07T20:25:35.1336199Z 
2025-05-07T20:25:35.1336203Z 
2025-05-07T20:25:35.1336207Z 
2025-05-07T20:25:35.1336221Z 
2025-05-07T20:25:35.1336225Z 
2025-05-07T20:25:35.1336229Z 
2025-05-07T20:25:35.1899093Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:35.3246167Z gcc_impl_linux-64-11 | 53.0 MB   | #########4 |  95% 
2025-05-07T20:25:35.3246542Z 
2025-05-07T20:25:35.3246546Z 
2025-05-07T20:25:35.3246550Z 
2025-05-07T20:25:35.3246584Z 
2025-05-07T20:25:35.3246587Z 
2025-05-07T20:25:35.3246591Z 
2025-05-07T20:25:35.3927488Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:35.3927959Z 
2025-05-07T20:25:35.3927965Z 
2025-05-07T20:25:35.3927972Z 
2025-05-07T20:25:35.3927978Z 
2025-05-07T20:25:35.3927984Z 
2025-05-07T20:25:35.3928006Z 
2025-05-07T20:25:35.3928013Z 
2025-05-07T20:25:35.3928022Z 
2025-05-07T20:25:35.3931491Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:35.3931964Z 
2025-05-07T20:25:35.3931970Z 
2025-05-07T20:25:35.3931986Z 
2025-05-07T20:25:35.3931993Z 
2025-05-07T20:25:35.3932036Z 
2025-05-07T20:25:35.3932043Z 
2025-05-07T20:25:35.3932049Z 
2025-05-07T20:25:35.3932055Z 
2025-05-07T20:25:35.4757568Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:35.4757904Z 
2025-05-07T20:25:35.4757908Z 
2025-05-07T20:25:35.4757912Z 
2025-05-07T20:25:35.5181581Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:35.5182068Z 
2025-05-07T20:25:35.5182075Z 
2025-05-07T20:25:35.5182080Z 
2025-05-07T20:25:35.5182085Z 
2025-05-07T20:25:35.5182091Z 
2025-05-07T20:25:35.5182096Z 
2025-05-07T20:25:35.5182101Z 
2025-05-07T20:25:35.5182106Z 
2025-05-07T20:25:35.5182110Z 
2025-05-07T20:25:35.5187266Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:35.5187717Z 
2025-05-07T20:25:35.5187721Z 
2025-05-07T20:25:35.5187725Z 
2025-05-07T20:25:35.5187728Z 
2025-05-07T20:25:35.5187736Z 
2025-05-07T20:25:35.5187742Z 
2025-05-07T20:25:35.5187747Z 
2025-05-07T20:25:35.5187753Z 
2025-05-07T20:25:35.5188055Z 
2025-05-07T20:25:35.5548425Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:35.5560769Z 
2025-05-07T20:25:35.5560779Z 
2025-05-07T20:25:35.5560786Z 
2025-05-07T20:25:35.5560792Z 
2025-05-07T20:25:35.5560798Z 
2025-05-07T20:25:35.5560816Z 
2025-05-07T20:25:35.5560822Z 
2025-05-07T20:25:35.5561169Z 
2025-05-07T20:25:35.5561173Z 
2025-05-07T20:25:35.5561179Z 
2025-05-07T20:25:35.5561846Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:35.5562285Z 
2025-05-07T20:25:35.5562291Z 
2025-05-07T20:25:35.5562297Z 
2025-05-07T20:25:35.5562302Z 
2025-05-07T20:25:35.5562305Z 
2025-05-07T20:25:35.5562320Z 
2025-05-07T20:25:35.5562323Z 
2025-05-07T20:25:35.5562327Z 
2025-05-07T20:25:35.5562330Z 
2025-05-07T20:25:35.5846603Z 
2025-05-07T20:25:35.5847412Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:35.5847756Z 
2025-05-07T20:25:35.5847760Z 
2025-05-07T20:25:35.5847797Z 
2025-05-07T20:25:35.5847801Z 
2025-05-07T20:25:35.5847805Z 
2025-05-07T20:25:35.5847808Z 
2025-05-07T20:25:35.5847812Z 
2025-05-07T20:25:35.5847815Z 
2025-05-07T20:25:35.5847820Z 
2025-05-07T20:25:35.5847824Z 
2025-05-07T20:25:35.5848153Z 
2025-05-07T20:25:35.5853743Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:35.5854156Z 
2025-05-07T20:25:35.5854161Z 
2025-05-07T20:25:35.5854166Z 
2025-05-07T20:25:35.5854171Z 
2025-05-07T20:25:35.5854176Z 
2025-05-07T20:25:35.5854181Z 
2025-05-07T20:25:35.5854186Z 
2025-05-07T20:25:35.5854192Z 
2025-05-07T20:25:35.5854197Z 
2025-05-07T20:25:35.5854202Z 
2025-05-07T20:25:35.5854207Z 
2025-05-07T20:25:35.6196448Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:35.7550340Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:25:35.7550622Z 
2025-05-07T20:25:35.9091937Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:25:35.9092383Z 
2025-05-07T20:25:35.9092388Z 
2025-05-07T20:25:36.3749102Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:25:36.3756131Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:25:36.3756765Z                                                      
2025-05-07T20:25:36.3757139Z 
2025-05-07T20:25:36.3757517Z                                                      [A
2025-05-07T20:25:36.3757939Z 
2025-05-07T20:25:36.3757945Z 
2025-05-07T20:25:36.3758240Z                                                      [A[A
2025-05-07T20:25:36.3758661Z 
2025-05-07T20:25:36.3758671Z 
2025-05-07T20:25:36.3758679Z 
2025-05-07T20:25:36.3758972Z                                                      [A[A[A
2025-05-07T20:25:36.3759336Z 
2025-05-07T20:25:36.3759344Z 
2025-05-07T20:25:36.3759350Z 
2025-05-07T20:25:36.3759358Z 
2025-05-07T20:25:36.3759675Z                                                      [A[A[A[A
2025-05-07T20:25:36.3760026Z 
2025-05-07T20:25:36.3760053Z 
2025-05-07T20:25:36.3760058Z 
2025-05-07T20:25:36.3760074Z 
2025-05-07T20:25:36.3760079Z 
2025-05-07T20:25:36.3760359Z                                                      [A[A[A[A[A
2025-05-07T20:25:36.3760685Z 
2025-05-07T20:25:36.3760691Z 
2025-05-07T20:25:36.3760696Z 
2025-05-07T20:25:36.3760700Z 
2025-05-07T20:25:36.3760715Z 
2025-05-07T20:25:36.3760728Z 
2025-05-07T20:25:36.3761010Z                                                      [A[A[A[A[A[A
2025-05-07T20:25:36.3761292Z 
2025-05-07T20:25:36.3761296Z 
2025-05-07T20:25:36.3761299Z 
2025-05-07T20:25:36.3761302Z 
2025-05-07T20:25:36.3761306Z 
2025-05-07T20:25:36.3761317Z 
2025-05-07T20:25:36.3761321Z 
2025-05-07T20:25:36.3761533Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:25:36.3761852Z 
2025-05-07T20:25:36.3761858Z 
2025-05-07T20:25:36.3761863Z 
2025-05-07T20:25:36.3761867Z 
2025-05-07T20:25:36.3761872Z 
2025-05-07T20:25:36.3761887Z 
2025-05-07T20:25:36.3761893Z 
2025-05-07T20:25:36.3762190Z 
2025-05-07T20:25:36.3762496Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:25:36.3762854Z 
2025-05-07T20:25:36.3762860Z 
2025-05-07T20:25:36.3762865Z 
2025-05-07T20:25:36.3762873Z 
2025-05-07T20:25:36.3762891Z 
2025-05-07T20:25:36.3762897Z 
2025-05-07T20:25:36.3762905Z 
2025-05-07T20:25:36.3762911Z 
2025-05-07T20:25:36.3763120Z 
2025-05-07T20:25:36.3763461Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:36.3763879Z 
2025-05-07T20:25:36.3763899Z 
2025-05-07T20:25:36.3763908Z 
2025-05-07T20:25:36.3763914Z 
2025-05-07T20:25:36.3763920Z 
2025-05-07T20:25:36.3763924Z 
2025-05-07T20:25:36.3763930Z 
2025-05-07T20:25:36.3763935Z 
2025-05-07T20:25:36.3763940Z 
2025-05-07T20:25:36.3763946Z 
2025-05-07T20:25:36.3764235Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:36.3764711Z 
2025-05-07T20:25:36.3764720Z 
2025-05-07T20:25:36.3764727Z 
2025-05-07T20:25:36.3764747Z 
2025-05-07T20:25:36.3764756Z 
2025-05-07T20:25:36.3764764Z 
2025-05-07T20:25:36.3764772Z 
2025-05-07T20:25:36.3764780Z 
2025-05-07T20:25:36.3764788Z 
2025-05-07T20:25:36.3764797Z 
2025-05-07T20:25:36.3764805Z 
2025-05-07T20:25:36.3765183Z                                                      [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:25:36.4765772Z Preparing transaction: \ done
2025-05-07T20:25:36.9772679Z Verifying transaction: / - \ | / done
2025-05-07T20:25:37.0783043Z Executing transaction: \ done
2025-05-07T20:25:37.2535924Z [INSTALL] Setting the C/C++ compiler symlinks ...
2025-05-07T20:25:41.2002539Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:25:41.2003118Z 
2025-05-07T20:25:41.2017204Z 
2025-05-07T20:25:41.2035585Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:41.2036182Z 
2025-05-07T20:25:41.2047438Z 
2025-05-07T20:25:41.2064629Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:41.2065205Z 
2025-05-07T20:25:41.2077761Z 
2025-05-07T20:25:41.2095084Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:41.2095662Z 
2025-05-07T20:25:41.2107309Z 
2025-05-07T20:25:43.1052204Z /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:25:43.1052494Z 
2025-05-07T20:25:43.1702138Z [CHECK] Binary cc found in PATH
2025-05-07T20:25:45.0730530Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:45.0730846Z 
2025-05-07T20:25:45.1400140Z [CHECK] Binary gcc found in PATH
2025-05-07T20:25:47.0548728Z /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:47.0549035Z 
2025-05-07T20:25:47.1191187Z [CHECK] Binary c++ found in PATH
2025-05-07T20:25:49.0174508Z /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:49.0174915Z 
2025-05-07T20:25:49.0800001Z [CHECK] Binary g++ found in PATH
2025-05-07T20:25:49.0804781Z [INFO] Printing out all preprocessor defines in the C compiler ...
2025-05-07T20:25:49.0805323Z + conda run -n build_binary cc -dM -E -
2025-05-07T20:25:49.0805544Z 
2025-05-07T20:25:50.9906817Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:50.9907269Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:50.9907731Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:50.9908081Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:50.9908421Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:50.9908789Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:50.9909091Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:50.9909415Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:50.9909682Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:50.9909952Z #define __CHAR_BIT__ 8
2025-05-07T20:25:50.9910564Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:50.9910821Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:50.9911179Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:50.9911589Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:50.9911903Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:50.9912354Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:50.9912797Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:50.9913439Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:50.9913843Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:50.9914176Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:50.9914589Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:50.9915020Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:50.9915341Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:50.9915636Z #define __GCC_IEC_559 2
2025-05-07T20:25:50.9915881Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:50.9916177Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:50.9916452Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:50.9916733Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:50.9917079Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:50.9917414Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:50.9917696Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:50.9917987Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:50.9918258Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:50.9918521Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:50.9918784Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:50.9919051Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:50.9919320Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:50.9919569Z #define __INT8_C(c) c
2025-05-07T20:25:50.9919811Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:50.9920121Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:50.9920447Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:50.9920773Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:50.9921141Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:50.9921420Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:50.9921695Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:50.9921984Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:50.9922263Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:50.9922673Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:50.9923112Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:50.9923409Z #define __linux 1
2025-05-07T20:25:50.9923656Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:50.9923954Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:50.9924249Z #define __unix 1
2025-05-07T20:25:50.9924478Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:50.9924776Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:50.9925063Z #define __WINT_MIN__ 0U
2025-05-07T20:25:50.9925311Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:50.9925961Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:50.9926254Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:50.9926523Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:50.9926786Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:50.9927085Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:50.9927386Z #define __INT64_C(c) c ## L
2025-05-07T20:25:50.9927672Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:50.9927982Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:50.9928249Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:50.9928611Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:50.9929008Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:50.9929272Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:50.9929540Z #define __DBL_DIG__ 15
2025-05-07T20:25:50.9929781Z #define __FLT32_DIG__ 6
2025-05-07T20:25:50.9930100Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:50.9930460Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:50.9930874Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:50.9931220Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:50.9931599Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:50.9931883Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:50.9932157Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:50.9932539Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:50.9933070Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:50.9933359Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:50.9933627Z #define __unix__ 1
2025-05-07T20:25:50.9933852Z #define __INT_WIDTH__ 32
2025-05-07T20:25:50.9934104Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:50.9934363Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:50.9934789Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:50.9935071Z #define __UINT16_C(c) c
2025-05-07T20:25:50.9935316Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:50.9935574Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:50.9935950Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:50.9936328Z #define __gnu_linux__ 1
2025-05-07T20:25:50.9936571Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:50.9936859Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:50.9937164Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:50.9937439Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:50.9937716Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:50.9937978Z #define __GNUC__ 11
2025-05-07T20:25:50.9938203Z #define __pie__ 2
2025-05-07T20:25:50.9938419Z #define __MMX__ 1
2025-05-07T20:25:50.9938652Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:50.9938930Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:50.9939212Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:50.9939496Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:50.9939858Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:50.9940272Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:50.9940614Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:50.9940877Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:50.9941153Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:50.9941483Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:50.9941782Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:50.9942055Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:50.9942356Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:50.9942665Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:50.9942936Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:50.9943278Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:50.9943539Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:50.9943814Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:50.9944098Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:50.9944366Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:50.9944629Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:50.9944953Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:50.9945324Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:50.9945604Z #define __SSE2_MATH__ 1
2025-05-07T20:25:50.9945857Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:50.9946163Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:50.9946463Z #define __amd64 1
2025-05-07T20:25:50.9946696Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:50.9946964Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:50.9947290Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:50.9947613Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:50.9947873Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:50.9948161Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:50.9948423Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:50.9948694Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:50.9948960Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:50.9949231Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:50.9949505Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:50.9949789Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:50.9950141Z #define __x86_64 1
2025-05-07T20:25:50.9950385Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:50.9950761Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:50.9951233Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:50.9951700Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:50.9952253Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:50.9952655Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:50.9952921Z #define __LP64__ 1
2025-05-07T20:25:50.9953155Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:50.9953508Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:50.9953903Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:50.9954185Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:50.9954470Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:50.9954771Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:50.9955059Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:50.9955333Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:50.9955605Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:50.9955876Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:50.9956139Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:50.9956479Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:50.9956865Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:50.9957147Z #define __FLT_DIG__ 6
2025-05-07T20:25:50.9957383Z #define __NO_INLINE__ 1
2025-05-07T20:25:50.9957637Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:50.9957973Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:50.9958335Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:50.9958600Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:50.9958875Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:50.9959136Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:50.9959411Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:50.9959682Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:50.9959984Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:50.9960280Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:50.9960556Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:50.9960862Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:50.9961207Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:50.9961481Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:50.9961746Z #define __FLT128_DIG__ 33
2025-05-07T20:25:50.9962010Z #define __INT32_C(c) c
2025-05-07T20:25:50.9962283Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:50.9962571Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:50.9962854Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:50.9963145Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:50.9963466Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:50.9963781Z #define unix 1
2025-05-07T20:25:50.9964014Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:50.9964343Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:50.9964656Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:50.9964976Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:50.9965320Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:50.9965574Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:50.9965846Z #define __ELF__ 1
2025-05-07T20:25:50.9966087Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:50.9966379Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:50.9966659Z #define __FLT_RADIX__ 2
2025-05-07T20:25:50.9966912Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:50.9967281Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:50.9967651Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:50.9967917Z #define __SSE_MATH__ 1
2025-05-07T20:25:50.9968151Z #define __k8 1
2025-05-07T20:25:50.9968447Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:50.9968933Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:50.9969246Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:50.9969549Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:50.9969823Z #define __LDBL_DIG__ 18
2025-05-07T20:25:50.9970074Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:50.9970341Z #define __x86_64__ 1
2025-05-07T20:25:50.9970585Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:50.9971006Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:50.9971357Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:50.9971712Z #define __FLT64_DIG__ 15
2025-05-07T20:25:50.9972007Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:50.9972369Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:50.9972689Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:50.9972966Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:50.9973252Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:50.9973552Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:50.9973932Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:50.9974348Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:50.9974772Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:50.9975125Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:50.9975463Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:50.9975782Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:50.9976068Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:50.9976386Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:50.9976676Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:50.9976919Z #define __SEG_FS 1
2025-05-07T20:25:50.9977153Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:50.9977440Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:50.9977720Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:50.9978018Z #define __SEG_GS 1
2025-05-07T20:25:50.9978345Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:50.9978736Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:50.9979018Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:50.9979320Z #define __INT16_TYPE__ short int
2025-05-07T20:25:50.9979607Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:50.9979906Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:50.9980180Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:50.9980437Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:50.9980700Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:50.9981052Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:50.9981454Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:50.9981746Z #define linux 1
2025-05-07T20:25:50.9981982Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:50.9982270Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:50.9982549Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:50.9982812Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:50.9983080Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:50.9983348Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:50.9983709Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:50.9984131Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:50.9984480Z #define __code_model_small__ 1
2025-05-07T20:25:50.9984759Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:50.9985054Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:50.9985317Z #define __k8__ 1
2025-05-07T20:25:50.9985549Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:50.9985847Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:50.9986158Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:50.9986404Z #define __pic__ 2
2025-05-07T20:25:50.9986660Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:50.9986982Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:50.9987276Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:50.9987619Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:50.9988096Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:50.9988467Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:50.9988747Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:50.9989051Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:50.9989376Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:50.9989629Z #define __linux__ 1
2025-05-07T20:25:50.9989996Z #define __INT64_TYPE__ long int
2025-05-07T20:25:50.9990264Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:50.9990524Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:50.9990802Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:50.9991067Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:50.9991360Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:50.9991703Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:50.9992012Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:50.9992281Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:50.9992588Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:50.9992901Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:50.9993250Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:50.9993614Z #define __SSE__ 1
2025-05-07T20:25:50.9993851Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:50.9994202Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:50.9994548Z #define __amd64__ 1
2025-05-07T20:25:50.9994784Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:50.9995046Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:50.9995315Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:50.9995591Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:50.9995863Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:50.9996138Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:50.9996403Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:50.9996685Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:50.9996955Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:50.9997310Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:50.9997794Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:50.9998161Z #define _LP64 1
2025-05-07T20:25:50.9998382Z #define __UINT8_C(c) c
2025-05-07T20:25:50.9998632Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:50.9998908Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:50.9999180Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:50.9999466Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:50.9999781Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:51.0000141Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:51.0000624Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:51.0001013Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:51.0001314Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:51.0001637Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:51.0002020Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:51.0002411Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:51.0002679Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:51.0003028Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:51.0003410Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:51.0003675Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:51.0003940Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:51.0004203Z #define __FXSR__ 1
2025-05-07T20:25:51.0004505Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:51.0004969Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:51.0014926Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:51.0015291Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:51.0015549Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:51.0015883Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:51.0016253Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:51.0016630Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:51.0016867Z #define __PIC__ 2
2025-05-07T20:25:51.0017116Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:51.0017519Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:51.0017910Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:51.0018239Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:51.0018649Z #define __SSE2__ 1
2025-05-07T20:25:51.0018866Z #define __INT32_TYPE__ int
2025-05-07T20:25:51.0019107Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:51.0019360Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:51.0019692Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:51.0020047Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:51.0020312Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:51.0020575Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:51.0020833Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:51.0021111Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:51.0021361Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:51.0021613Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:51.0021903Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:51.0022212Z #define __PIE__ 2
2025-05-07T20:25:51.0022543Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:51.0022960Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:51.0023318Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:51.0023697Z #define __INT16_C(c) c
2025-05-07T20:25:51.0023919Z #define __STDC__ 1
2025-05-07T20:25:51.0024156Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:51.0024438Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:51.0024696Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:51.0025003Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:51.0025363Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:51.0026022Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:51.0026302Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:51.0026591Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:51.0026864Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:51.0027149Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:51.0027447Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:51.0027726Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:51.0028029Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:51.0028433Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:51.0028820Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:51.0029128Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:51.0029439Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:51.0029700Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:51.0029862Z 
2025-05-07T20:25:51.0580865Z 
2025-05-07T20:25:51.0581247Z [INFO] Printing out all preprocessor defines in the C++ compiler ...
2025-05-07T20:25:51.0581922Z + conda run -n build_binary c++ -dM -E -x c++ -
2025-05-07T20:25:51.0582281Z 
2025-05-07T20:25:52.9725962Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:52.9726465Z #define __cpp_attributes 200809L
2025-05-07T20:25:52.9726975Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:25:52.9727511Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:52.9727932Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:52.9728345Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:52.9728751Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:52.9729117Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:52.9729413Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:25:52.9729736Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:52.9730051Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:52.9730332Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:52.9730605Z #define __CHAR_BIT__ 8
2025-05-07T20:25:52.9730847Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:52.9731112Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:52.9733292Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:52.9733584Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:52.9733884Z #define __cpp_static_assert 201411L
2025-05-07T20:25:52.9734195Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:52.9734506Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:52.9734981Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:52.9735462Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:52.9735797Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:52.9736139Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:52.9736562Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:52.9736989Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:52.9737309Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:52.9737604Z #define __GCC_IEC_559 2
2025-05-07T20:25:52.9737863Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:52.9738143Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:52.9738438Z #define __cpp_binary_literals 201304L
2025-05-07T20:25:52.9738752Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:52.9739053Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:25:52.9739392Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:52.9739727Z #define __cpp_variadic_templates 200704L
2025-05-07T20:25:52.9740073Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:52.9740423Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:52.9740710Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:52.9741002Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:52.9741284Z #define __cpp_variable_templates 201304L
2025-05-07T20:25:52.9741602Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:52.9741886Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:52.9742179Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:52.9742491Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:25:52.9742920Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:25:52.9743268Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:52.9743529Z #define __INT8_C(c) c
2025-05-07T20:25:52.9743775Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:52.9744053Z #define __cpp_variadic_using 201611L
2025-05-07T20:25:52.9744397Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:52.9744739Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:52.9745030Z #define __cpp_capture_star_this 201603L
2025-05-07T20:25:52.9745334Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:52.9745666Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:52.9746035Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:52.9746332Z #define __cpp_if_constexpr 201606L
2025-05-07T20:25:52.9746632Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:52.9746911Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:52.9747198Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:52.9747494Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:52.9747912Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:52.9748335Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:52.9748643Z #define __linux 1
2025-05-07T20:25:52.9748884Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:52.9749183Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:52.9749467Z #define __unix 1
2025-05-07T20:25:52.9749709Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:52.9750014Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:25:52.9750312Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:52.9750600Z #define __WINT_MIN__ 0U
2025-05-07T20:25:52.9750856Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:52.9751144Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:52.9751435Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:52.9751713Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:52.9751969Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:52.9752265Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:52.9752582Z #define __INT64_C(c) c ## L
2025-05-07T20:25:52.9752948Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:52.9753265Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:52.9753555Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:52.9753871Z #define __cpp_aligned_new 201606L
2025-05-07T20:25:52.9754159Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:52.9754439Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:52.9754880Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:52.9755269Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:52.9755534Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:52.9755825Z #define __cpp_decltype_auto 201304L
2025-05-07T20:25:52.9756110Z #define __DBL_DIG__ 15
2025-05-07T20:25:52.9756347Z #define __FLT32_DIG__ 6
2025-05-07T20:25:52.9756658Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:52.9757010Z #define __GXX_WEAK__ 1
2025-05-07T20:25:52.9757254Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:52.9757523Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:52.9757857Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:52.9758223Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:52.9758499Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:52.9758813Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:25:52.9759158Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:25:52.9759590Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:52.9760013Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:52.9760299Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:52.9760570Z #define __unix__ 1
2025-05-07T20:25:52.9760805Z #define __INT_WIDTH__ 32
2025-05-07T20:25:52.9761054Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:52.9761312Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:52.9761578Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:52.9761851Z #define __UINT16_C(c) c
2025-05-07T20:25:52.9762097Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:52.9762374Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:52.9762749Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:52.9763123Z #define __gnu_linux__ 1
2025-05-07T20:25:52.9763373Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:52.9763651Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:52.9763940Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:52.9764253Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:52.9764540Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:52.9764806Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:52.9765073Z #define __GNUC__ 11
2025-05-07T20:25:52.9765302Z #define __GXX_RTTI 1
2025-05-07T20:25:52.9765530Z #define __pie__ 2
2025-05-07T20:25:52.9765752Z #define __MMX__ 1
2025-05-07T20:25:52.9765984Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:52.9766254Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:52.9766562Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:52.9766838Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:52.9767095Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:52.9767424Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:25:52.9767750Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:52.9768106Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:52.9768491Z #define __cpp_raw_strings 200710L
2025-05-07T20:25:52.9768798Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:52.9769130Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:52.9769401Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:52.9769670Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:52.9769987Z #define __cpp_fold_expressions 201603L
2025-05-07T20:25:52.9770290Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:52.9770556Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:52.9770821Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:52.9771115Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:52.9771418Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:52.9771790Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:52.9772076Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:52.9772333Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:52.9772620Z #define __cplusplus 201703L
2025-05-07T20:25:52.9772905Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:25:52.9773192Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:52.9773442Z #define __DEPRECATED 1
2025-05-07T20:25:52.9773775Z #define __cpp_rvalue_references 200610L
2025-05-07T20:25:52.9774073Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:52.9774327Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:52.9774705Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:52.9775066Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:52.9775328Z #define __SSE2_MATH__ 1
2025-05-07T20:25:52.9775576Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:52.9775876Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:52.9776168Z #define __amd64 1
2025-05-07T20:25:52.9776389Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:52.9776662Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:52.9776930Z #define __GNUG__ 11
2025-05-07T20:25:52.9777176Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:52.9777491Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:52.9777744Z #define __cpp_nsdmi 200809L
2025-05-07T20:25:52.9777996Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:52.9778278Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:52.9778544Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:52.9778815Z #define __cpp_initializer_lists 200806L
2025-05-07T20:25:52.9779112Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:52.9779376Z #define __cpp_hex_float 201603L
2025-05-07T20:25:52.9779635Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:52.9779903Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:52.9780179Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:52.9780439Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:52.9780710Z #define __x86_64 1
2025-05-07T20:25:52.9780940Z #define __cpp_lambdas 200907L
2025-05-07T20:25:52.9781216Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:52.9781581Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:52.9781975Z #define __cpp_template_auto 201606L
2025-05-07T20:25:52.9782335Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:52.9782840Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:52.9783318Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:52.9783713Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:52.9783962Z #define __LP64__ 1
2025-05-07T20:25:52.9784191Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:52.9784539Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:52.9784925Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:52.9785196Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:52.9785483Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:52.9785768Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:52.9786032Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:52.9786293Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:52.9786562Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:52.9786886Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:52.9787248Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:52.9787528Z #define __FLT_DIG__ 6
2025-05-07T20:25:52.9787751Z #define __NO_INLINE__ 1
2025-05-07T20:25:52.9787991Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:52.9788314Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:52.9788661Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:52.9788915Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:52.9789184Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:52.9789442Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:52.9789708Z #define __cpp_unicode_characters 201411L
2025-05-07T20:25:52.9790006Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:52.9790393Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:52.9790684Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:52.9790973Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:52.9791243Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:52.9791537Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:52.9791882Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:25:52.9792287Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:52.9792550Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:52.9792813Z #define __FLT128_DIG__ 33
2025-05-07T20:25:52.9793060Z #define __INT32_C(c) c
2025-05-07T20:25:52.9793296Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:52.9793577Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:52.9793863Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:52.9794146Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:52.9794462Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:52.9794780Z #define unix 1
2025-05-07T20:25:52.9794999Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:52.9795257Z #define __cpp_rtti 199711L
2025-05-07T20:25:52.9795522Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:52.9795839Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:52.9796142Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:52.9796453Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:52.9796785Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:52.9797037Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:52.9797327Z #define __cpp_digit_separators 201309L
2025-05-07T20:25:52.9797608Z #define __ELF__ 1
2025-05-07T20:25:52.9797832Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:52.9798119Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:52.9798400Z #define __FLT_RADIX__ 2
2025-05-07T20:25:52.9798646Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:52.9799000Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:52.9799371Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:52.9799651Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:25:52.9799918Z #define __k8 1
2025-05-07T20:25:52.9800211Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:52.9800599Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:52.9800890Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:52.9801193Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:52.9801456Z #define __LDBL_DIG__ 18
2025-05-07T20:25:52.9801689Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:52.9801947Z #define __x86_64__ 1
2025-05-07T20:25:52.9802184Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:52.9802479Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:52.9802820Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:52.9803129Z #define __FLT64_DIG__ 15
2025-05-07T20:25:52.9803407Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:52.9803751Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:52.9804071Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:52.9804340Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:52.9804612Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:52.9804913Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:52.9805276Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:52.9805669Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:52.9805961Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:52.9806285Z #define __cpp_unicode_literals 200710L
2025-05-07T20:25:52.9806598Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:52.9806925Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:52.9807224Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:52.9807503Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:52.9807808Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:52.9808088Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:52.9808326Z #define __SEG_FS 1
2025-05-07T20:25:52.9808550Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:52.9808918Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:52.9809204Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:52.9809488Z #define __SEG_GS 1
2025-05-07T20:25:52.9809799Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:52.9810185Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:52.9810454Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:52.9810824Z #define __INT16_TYPE__ short int
2025-05-07T20:25:52.9811110Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:52.9811425Z #define __cpp_structured_bindings 201606L
2025-05-07T20:25:52.9811716Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:52.9811964Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:52.9812228Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:52.9812576Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:52.9813006Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:52.9813325Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:25:52.9813651Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:25:52.9813953Z #define linux 1
2025-05-07T20:25:52.9814177Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:52.9814448Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:52.9814779Z #define __EXCEPTIONS 1
2025-05-07T20:25:52.9815026Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:52.9815284Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:52.9815561Z #define __cpp_range_based_for 201603L
2025-05-07T20:25:52.9815855Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:52.9816207Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:52.9816594Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:25:52.9816941Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:52.9817278Z #define __code_model_small__ 1
2025-05-07T20:25:52.9817545Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:52.9817853Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:25:52.9818162Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:52.9818431Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:25:52.9818728Z #define __k8__ 1
2025-05-07T20:25:52.9818954Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:52.9819238Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:52.9819541Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:52.9819786Z #define __pic__ 2
2025-05-07T20:25:52.9820037Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:52.9820344Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:52.9820611Z #define __cpp_decltype 200707L
2025-05-07T20:25:52.9820904Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:52.9821229Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:52.9821603Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:52.9821966Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:52.9822261Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:52.9822619Z #define __cpp_inline_variables 201606L
2025-05-07T20:25:52.9822945Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:52.9823191Z #define __linux__ 1
2025-05-07T20:25:52.9823417Z #define __INT64_TYPE__ long int
2025-05-07T20:25:52.9823681Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:52.9832919Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:52.9833262Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:52.9833567Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:25:52.9833922Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:52.9834225Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:52.9834555Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:52.9834830Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:52.9835126Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:52.9835434Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:52.9835781Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:52.9836147Z #define __SSE__ 1
2025-05-07T20:25:52.9836386Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:52.9836923Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:52.9837280Z #define __amd64__ 1
2025-05-07T20:25:52.9837513Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:52.9837782Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:52.9838067Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:52.9838333Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:52.9838750Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:52.9839018Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:52.9839300Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:52.9839578Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:52.9839935Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:52.9840412Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:52.9840784Z #define _LP64 1
2025-05-07T20:25:52.9841010Z #define __UINT8_C(c) c
2025-05-07T20:25:52.9841254Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:52.9841536Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:52.9841819Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:52.9842083Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:52.9842452Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:52.9842936Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:52.9843324Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:52.9843632Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:52.9843957Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:52.9844283Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:25:52.9844669Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:52.9845061Z #define __STDCPP_THREADS__ 1
2025-05-07T20:25:52.9845339Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:52.9845606Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:52.9845965Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:52.9846359Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:52.9846633Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:52.9846884Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:52.9847140Z #define __FXSR__ 1
2025-05-07T20:25:52.9847455Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:52.9847917Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:52.9848350Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:52.9848670Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:52.9848940Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:25:52.9849254Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:52.9849565Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:52.9849839Z #define __cpp_alias_templates 200704L
2025-05-07T20:25:52.9850209Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:52.9850581Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:52.9850852Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:52.9851106Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:52.9851347Z #define __PIC__ 2
2025-05-07T20:25:52.9851598Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:52.9851995Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:52.9852387Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:52.9852726Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:52.9853078Z #define __cpp_constexpr 201603L
2025-05-07T20:25:52.9853345Z #define __SSE2__ 1
2025-05-07T20:25:52.9853588Z #define __cpp_deduction_guides 201703L
2025-05-07T20:25:52.9853878Z #define __INT32_TYPE__ int
2025-05-07T20:25:52.9854134Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:52.9854402Z #define __cpp_exceptions 199711L
2025-05-07T20:25:52.9854760Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:52.9855102Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:52.9855467Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:52.9855836Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:52.9857564Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:52.9857837Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:52.9858116Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:52.9858359Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:52.9858613Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:25:52.9858908Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:52.9859277Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:52.9859574Z #define __PIE__ 2
2025-05-07T20:25:52.9859895Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:52.9860317Z #define __cpp_template_template_args 201611L
2025-05-07T20:25:52.9860636Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:52.9860985Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:52.9861354Z #define __INT16_C(c) c
2025-05-07T20:25:52.9861574Z #define __STDC__ 1
2025-05-07T20:25:52.9861795Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:52.9862055Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:52.9862324Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:52.9862583Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:52.9862882Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:52.9863225Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:52.9863567Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:52.9863838Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:52.9864122Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:25:52.9864406Z #define __SSE_MATH__ 1
2025-05-07T20:25:52.9864655Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:52.9864934Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:25:52.9865241Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:52.9865528Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:52.9865819Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:52.9866089Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:52.9866393Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:52.9866792Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:52.9867173Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:52.9867480Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:52.9867775Z #define _GNU_SOURCE 1
2025-05-07T20:25:52.9868015Z #define __cpp_init_captures 201304L
2025-05-07T20:25:52.9868301Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:52.9868550Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:52.9868710Z 
2025-05-07T20:25:53.0402414Z 
2025-05-07T20:25:53.0403057Z + conda run -n build_binary c++ --version
2025-05-07T20:25:53.0403654Z 
2025-05-07T20:25:54.9421928Z c++ (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:25:54.9422483Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:25:54.9422962Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:25:54.9423529Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:25:54.9423903Z 
2025-05-07T20:25:54.9423909Z 
2025-05-07T20:25:55.0149351Z 
2025-05-07T20:25:55.0150197Z [INFO] Printing the default version of the C standard used by the compiler ...
2025-05-07T20:25:55.0150870Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__
2025-05-07T20:25:55.0151187Z 
2025-05-07T20:25:56.9939562Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:56.9941903Z 
2025-05-07T20:25:56.9942457Z [INFO] Printing the default version of the C++ standard used by the compiler ...
2025-05-07T20:25:56.9943271Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus
2025-05-07T20:25:56.9943730Z 
2025-05-07T20:25:58.9855703Z #define __cplusplus 201703L
2025-05-07T20:25:58.9858137Z 
2025-05-07T20:25:58.9859290Z [INSTALL] Successfully installed C/C++ compilers
2025-05-07T20:25:58.9896569Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3
2025-05-07T20:25:58.9897025Z [36;1m. $PRELUDE; install_cuda $BUILD_ENV 12.6.3[0m
2025-05-07T20:25:58.9909286Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:58.9909660Z env:
2025-05-07T20:25:58.9909907Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:58.9910221Z   BUILD_ENV: build_binary
2025-05-07T20:25:58.9910481Z   BUILD_TARGET: genai
2025-05-07T20:25:58.9910740Z   BUILD_VARIANT: cuda
2025-05-07T20:25:58.9910989Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:25:58.9911462Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:58.9911784Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:58.9912143Z ##[endgroup]
2025-05-07T20:25:59.3334929Z ################################################################################
2025-05-07T20:25:59.3335313Z # Install CUDA
2025-05-07T20:25:59.3335527Z #
2025-05-07T20:25:59.3351763Z # [2025-05-07T20:25:59.334Z] + install_cuda build_binary 12.6.3
2025-05-07T20:25:59.3352352Z ################################################################################
2025-05-07T20:25:59.3352689Z 
2025-05-07T20:25:59.3368973Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:59.4230782Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:59.4231348Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:25:59.4236414Z + conda clean --packages --tarball -y
2025-05-07T20:25:59.4236653Z 
2025-05-07T20:26:00.3039237Z Will remove 40 (182.7 MB) tarball(s).
2025-05-07T20:26:00.3039625Z Will remove 7 (108.6 MB) package(s).
2025-05-07T20:26:00.3757723Z 
2025-05-07T20:26:00.3769207Z + conda clean --all -y
2025-05-07T20:26:00.3769382Z 
2025-05-07T20:26:01.0520780Z There are no unused tarball(s) to remove.
2025-05-07T20:26:01.0521655Z Will remove 1 index cache(s).
2025-05-07T20:26:01.0522347Z There are no unused package(s) to remove.
2025-05-07T20:26:01.0523059Z There are no tempfile(s) to remove.
2025-05-07T20:26:01.0523647Z There are no logfile(s) to remove.
2025-05-07T20:26:01.1190911Z 
2025-05-07T20:26:01.1205868Z [INSTALL] Installing CUDA 12.6.3 ...
2025-05-07T20:26:01.1231850Z [EXEC] [ATTEMPT 0/3]    + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3
2025-05-07T20:26:02.0325968Z Channels:
2025-05-07T20:26:02.0326222Z  - conda-forge
2025-05-07T20:26:02.0326475Z Platform: linux-64
2025-05-07T20:26:12.9108405Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:26:14.0257885Z Solving environment: \ | / - \ done
2025-05-07T20:26:14.1017418Z 
2025-05-07T20:26:14.1017721Z ## Package Plan ##
2025-05-07T20:26:14.1017886Z 
2025-05-07T20:26:14.1018233Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:26:14.1018673Z 
2025-05-07T20:26:14.1018827Z   added / updated specs:
2025-05-07T20:26:14.1019131Z     - cuda=12.6.3
2025-05-07T20:26:14.1019265Z 
2025-05-07T20:26:14.1019300Z 
2025-05-07T20:26:14.1019420Z The following packages will be downloaded:
2025-05-07T20:26:14.1019640Z 
2025-05-07T20:26:14.1019752Z     package                    |            build
2025-05-07T20:26:14.1020078Z     ---------------------------|-----------------
2025-05-07T20:26:14.1020454Z     alsa-lib-1.2.14            |       hb9d3cd8_0         553 KB  conda-forge
2025-05-07T20:26:14.1020864Z     attr-2.5.1                 |       h166bdaf_1          69 KB  conda-forge
2025-05-07T20:26:14.1021436Z     binutils-2.40              |       h4852527_7          31 KB  conda-forge
2025-05-07T20:26:14.1022104Z     c-compiler-1.5.2           |       h0b41bf4_0           6 KB  conda-forge
2025-05-07T20:26:14.1022677Z     cuda-12.6.3                |       ha804496_0          26 KB  conda-forge
2025-05-07T20:26:14.1023200Z     cuda-cccl_linux-64-12.6.77 |       ha770c72_0         1.0 MB  conda-forge
2025-05-07T20:26:14.1024420Z     cuda-command-line-tools-12.6.3|       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:14.1024950Z     cuda-compiler-12.6.3       |       hbad6d8a_0          20 KB  conda-forge
2025-05-07T20:26:14.1025742Z     cuda-crt-dev_linux-64-12.6.85|       ha770c72_0          87 KB  conda-forge
2025-05-07T20:26:14.1026239Z     cuda-crt-tools-12.6.85     |       ha770c72_0          26 KB  conda-forge
2025-05-07T20:26:14.1026712Z     cuda-cudart-12.6.77        |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:26:14.1027191Z     cuda-cudart-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:26:14.1027699Z     cuda-cudart-dev_linux-64-12.6.77|       h3f2d84a_0         357 KB  conda-forge
2025-05-07T20:26:14.1028410Z     cuda-cudart-static-12.6.77 |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:26:14.1028953Z     cuda-cudart-static_linux-64-12.6.77|       h3f2d84a_0         744 KB  conda-forge
2025-05-07T20:26:14.1029482Z     cuda-cudart_linux-64-12.6.77|       h3f2d84a_0         184 KB  conda-forge
2025-05-07T20:26:14.1029983Z     cuda-cuobjdump-12.6.77     |       hbd13f7d_1         241 KB  conda-forge
2025-05-07T20:26:14.1030453Z     cuda-cupti-12.6.80         |       hbd13f7d_0         1.9 MB  conda-forge
2025-05-07T20:26:14.1030922Z     cuda-cupti-dev-12.6.80     |       h5888daf_0         3.4 MB  conda-forge
2025-05-07T20:26:14.1031395Z     cuda-cuxxfilt-12.6.77      |       hbd13f7d_1         211 KB  conda-forge
2025-05-07T20:26:14.1031883Z     cuda-driver-dev-12.6.77    |       h5888daf_0          22 KB  conda-forge
2025-05-07T20:26:14.1032399Z     cuda-driver-dev_linux-64-12.6.77|       h3f2d84a_0          35 KB  conda-forge
2025-05-07T20:26:14.1032888Z     cuda-gdb-12.6.77           |       h50b4baa_1         370 KB  conda-forge
2025-05-07T20:26:14.1033349Z     cuda-libraries-12.6.3      |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:14.1033846Z     cuda-libraries-dev-12.6.3  |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:14.1034335Z     cuda-nsight-12.6.77        |       h7938cbb_0       113.2 MB  conda-forge
2025-05-07T20:26:14.1034784Z     cuda-nvcc-12.6.85          |       hcdd1206_0          23 KB  conda-forge
2025-05-07T20:26:14.1035260Z     cuda-nvcc-dev_linux-64-12.6.85|       he91c749_0        10.8 MB  conda-forge
2025-05-07T20:26:14.1035752Z     cuda-nvcc-impl-12.6.85     |       h85509e4_0          25 KB  conda-forge
2025-05-07T20:26:14.1036227Z     cuda-nvcc-tools-12.6.85    |       he02047a_0        23.0 MB  conda-forge
2025-05-07T20:26:14.1036718Z     cuda-nvcc_linux-64-12.6.85 |       h04802cd_0          25 KB  conda-forge
2025-05-07T20:26:14.1037205Z     cuda-nvdisasm-12.6.77      |       hbd13f7d_1        47.6 MB  conda-forge
2025-05-07T20:26:14.1037679Z     cuda-nvml-dev-12.6.77      |       hbd13f7d_1         159 KB  conda-forge
2025-05-07T20:26:14.1038138Z     cuda-nvprof-12.6.80        |       hbd13f7d_0         2.6 MB  conda-forge
2025-05-07T20:26:14.1038603Z     cuda-nvprune-12.6.77       |       hbd13f7d_1          66 KB  conda-forge
2025-05-07T20:26:14.1039076Z     cuda-nvrtc-12.6.85         |       hbd13f7d_0        17.3 MB  conda-forge
2025-05-07T20:26:14.1039546Z     cuda-nvrtc-dev-12.6.85     |       h5888daf_0          31 KB  conda-forge
2025-05-07T20:26:14.1039999Z     cuda-nvtx-12.6.77          |       hbd13f7d_0          31 KB  conda-forge
2025-05-07T20:26:14.1040476Z     cuda-nvvm-dev_linux-64-12.6.85|       ha770c72_0          25 KB  conda-forge
2025-05-07T20:26:14.1040971Z     cuda-nvvm-impl-12.6.85     |       he02047a_0         7.7 MB  conda-forge
2025-05-07T20:26:14.1041445Z     cuda-nvvm-tools-12.6.85    |       he02047a_0        10.4 MB  conda-forge
2025-05-07T20:26:14.1041917Z     cuda-nvvp-12.6.80          |       hbd13f7d_1       109.3 MB  conda-forge
2025-05-07T20:26:14.1042372Z     cuda-opencl-12.6.77        |       hbd13f7d_0          29 KB  conda-forge
2025-05-07T20:26:14.1042850Z     cuda-opencl-dev-12.6.77    |       h5888daf_0          93 KB  conda-forge
2025-05-07T20:26:14.1043509Z     cuda-profiler-api-12.6.77  |       h7938cbb_0          22 KB  conda-forge
2025-05-07T20:26:14.1043998Z     cuda-runtime-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:26:14.1044489Z     cuda-sanitizer-api-12.6.77 |       hbd13f7d_1         8.9 MB  conda-forge
2025-05-07T20:26:14.1044969Z     cuda-toolkit-12.6.3        |       ha804496_0          19 KB  conda-forge
2025-05-07T20:26:14.1045424Z     cuda-tools-12.6.3          |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:26:14.1045877Z     cuda-version-12.6          |       h7480c83_3          20 KB  conda-forge
2025-05-07T20:26:14.1046356Z     cuda-visual-tools-12.6.3   |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:26:14.1046913Z     cxx-compiler-1.5.2         |       hf52228f_0           6 KB  conda-forge
2025-05-07T20:26:14.1047344Z     dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
2025-05-07T20:26:14.1047874Z     font-ttf-dejavu-sans-mono-2.37|       hab24e00_0         388 KB  conda-forge
2025-05-07T20:26:14.1048412Z     font-ttf-inconsolata-3.000 |       h77eed37_0          94 KB  conda-forge
2025-05-07T20:26:14.1048952Z     font-ttf-source-code-pro-2.038|       h77eed37_0         684 KB  conda-forge
2025-05-07T20:26:14.1049507Z     font-ttf-ubuntu-0.83       |       h77eed37_3         1.5 MB  conda-forge
2025-05-07T20:26:14.1049972Z     fontconfig-2.15.0          |       h7e30c49_1         259 KB  conda-forge
2025-05-07T20:26:14.1050454Z     fonts-conda-ecosystem-1    |                0           4 KB  conda-forge
2025-05-07T20:26:14.1050949Z     fonts-conda-forge-1        |                0           4 KB  conda-forge
2025-05-07T20:26:14.1051406Z     freetype-2.13.3            |       ha770c72_1         168 KB  conda-forge
2025-05-07T20:26:14.1051826Z     gcc-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:26:14.1052243Z     gds-tools-1.11.1.6         |       h5888daf_4        37.8 MB  conda-forge
2025-05-07T20:26:14.1052665Z     gmp-6.3.0                  |       hac33072_2         449 KB  conda-forge
2025-05-07T20:26:14.1053046Z     gxx-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:26:14.1053451Z     keyutils-1.6.1             |       h166bdaf_0         115 KB  conda-forge
2025-05-07T20:26:14.1053857Z     krb5-1.21.3                |       h659f571_0         1.3 MB  conda-forge
2025-05-07T20:26:14.1054251Z     libcap-2.71                |       h39aace5_0         100 KB  conda-forge
2025-05-07T20:26:14.1054807Z     libcublas-12.6.4.1         |       h5888daf_1       256.2 MB  conda-forge
2025-05-07T20:26:14.1055264Z     libcublas-dev-12.6.4.1     |       h5888daf_1          88 KB  conda-forge
2025-05-07T20:26:14.1055725Z     libcufft-11.3.0.4          |       hbd13f7d_0       156.2 MB  conda-forge
2025-05-07T20:26:14.1056170Z     libcufft-dev-11.3.0.4      |       h5888daf_0          33 KB  conda-forge
2025-05-07T20:26:14.1056622Z     libcufile-1.11.1.6         |       h12f29b5_4         900 KB  conda-forge
2025-05-07T20:26:14.1057082Z     libcufile-dev-1.11.1.6     |       h5888daf_4          35 KB  conda-forge
2025-05-07T20:26:14.1057539Z     libcurand-10.3.7.77        |       hbd13f7d_0        39.9 MB  conda-forge
2025-05-07T20:26:14.1058050Z     libcurand-dev-10.3.7.77    |       h5888daf_0         262 KB  conda-forge
2025-05-07T20:26:14.1058514Z     libcusolver-11.7.1.2       |       h5888daf_1        95.8 MB  conda-forge
2025-05-07T20:26:14.1058987Z     libcusolver-dev-11.7.1.2   |       h5888daf_1          59 KB  conda-forge
2025-05-07T20:26:14.1059456Z     libcusparse-12.5.4.2       |       hbd13f7d_0       118.6 MB  conda-forge
2025-05-07T20:26:14.1059937Z     libcusparse-dev-12.5.4.2   |       h5888daf_0          51 KB  conda-forge
2025-05-07T20:26:14.1060416Z     libedit-3.1.20191231       |       he28a2e2_2         121 KB  conda-forge
2025-05-07T20:26:14.1060870Z     libfreetype-2.13.3         |       ha770c72_1           8 KB  conda-forge
2025-05-07T20:26:14.1061330Z     libfreetype6-2.13.3        |       h48d6fc4_1         371 KB  conda-forge
2025-05-07T20:26:14.1061900Z     libgcrypt-lib-1.11.0       |       hb9d3cd8_2         572 KB  conda-forge
2025-05-07T20:26:14.1062350Z     libglib-2.84.0             |       h2ff4ddf_0         3.8 MB  conda-forge
2025-05-07T20:26:14.1062783Z     libgpg-error-1.55          |       h3f2d84a_0         305 KB  conda-forge
2025-05-07T20:26:14.1063228Z     libiconv-1.18              |       h4ce23a2_1         696 KB  conda-forge
2025-05-07T20:26:14.1063650Z     libnl-3.11.0               |       hb9d3cd8_0         724 KB  conda-forge
2025-05-07T20:26:14.1064068Z     libnpp-12.3.1.54           |       h5888daf_0        93.4 MB  conda-forge
2025-05-07T20:26:14.1064585Z     libnpp-dev-12.3.1.54       |       h5888daf_0         441 KB  conda-forge
2025-05-07T20:26:14.1065024Z     libnuma-2.0.18             |       h4ab18f5_2          42 KB  conda-forge
2025-05-07T20:26:14.1065464Z     libnvfatbin-12.6.77        |       hbd13f7d_0         783 KB  conda-forge
2025-05-07T20:26:14.1065937Z     libnvfatbin-dev-12.6.77    |       h5888daf_0          26 KB  conda-forge
2025-05-07T20:26:14.1066411Z     libnvjitlink-12.6.85       |       hbd13f7d_0        14.9 MB  conda-forge
2025-05-07T20:26:14.1066888Z     libnvjitlink-dev-12.6.85   |       h5888daf_0          25 KB  conda-forge
2025-05-07T20:26:14.1067358Z     libnvjpeg-12.3.3.54        |       h5888daf_0         2.4 MB  conda-forge
2025-05-07T20:26:14.1067865Z     libnvjpeg-dev-12.3.3.54    |       ha770c72_0          31 KB  conda-forge
2025-05-07T20:26:14.1068308Z     libpng-1.6.47              |       h943b412_0         282 KB  conda-forge
2025-05-07T20:26:14.1068740Z     libsqlite-3.49.2           |       hee588c1_0         895 KB  conda-forge
2025-05-07T20:26:14.1069196Z     libsystemd0-256.9          |       h2774228_0         401 KB  conda-forge
2025-05-07T20:26:14.1069630Z     libudev1-257.4             |       h9a4d06a_0         140 KB  conda-forge
2025-05-07T20:26:14.1070053Z     libxcb-1.17.0              |       h8a09558_0         387 KB  conda-forge
2025-05-07T20:26:14.1070496Z     libxkbcommon-1.8.0         |       hc4a0caf_0         627 KB  conda-forge
2025-05-07T20:26:14.1079687Z     libxkbfile-1.1.0           |       h166bdaf_1         111 KB  conda-forge
2025-05-07T20:26:14.1080176Z     libxml2-2.13.5             |       h064dc61_0         673 KB  conda-forge
2025-05-07T20:26:14.1080620Z     libzlib-1.3.1              |       hb9d3cd8_2          60 KB  conda-forge
2025-05-07T20:26:14.1081039Z     lz4-c-1.9.4                |       hcb278e6_0         140 KB  conda-forge
2025-05-07T20:26:14.1081508Z     nsight-compute-2024.3.2.3  |       hb5ebaad_0       443.1 MB  conda-forge
2025-05-07T20:26:14.1081978Z     nspr-4.36                  |       h5888daf_0         225 KB  conda-forge
2025-05-07T20:26:14.1082388Z     nss-3.111                  |       h159eef7_0         1.9 MB  conda-forge
2025-05-07T20:26:14.1082803Z     ocl-icd-2.3.3              |       hb9d3cd8_0         104 KB  conda-forge
2025-05-07T20:26:14.1083269Z     opencl-headers-2024.10.24  |       h5888daf_0          53 KB  conda-forge
2025-05-07T20:26:14.1083731Z     pcre2-10.44                |       hc749103_2         934 KB  conda-forge
2025-05-07T20:26:14.1084179Z     pthread-stubs-0.4          |    hb9d3cd8_1002           8 KB  conda-forge
2025-05-07T20:26:14.1084623Z     rdma-core-55.0             |       h5888daf_0         1.2 MB  conda-forge
2025-05-07T20:26:14.1085037Z     sqlite-3.32.3              |       hcee41ef_1         1.4 MB  conda-forge
2025-05-07T20:26:14.1085451Z     tk-8.6.13                  |noxft_h4845f30_101         3.2 MB  conda-forge
2025-05-07T20:26:14.1085864Z     wayland-1.23.1             |       h3e06ad9_0         314 KB  conda-forge
2025-05-07T20:26:14.1086283Z     xcb-util-0.4.1             |       hb711507_2          19 KB  conda-forge
2025-05-07T20:26:14.1086739Z     xcb-util-cursor-0.1.5      |       hb9d3cd8_0          20 KB  conda-forge
2025-05-07T20:26:14.1087220Z     xcb-util-image-0.4.0       |       hb711507_2          24 KB  conda-forge
2025-05-07T20:26:14.1087850Z     xcb-util-keysyms-0.4.1     |       hb711507_0          14 KB  conda-forge
2025-05-07T20:26:14.1088364Z     xcb-util-renderutil-0.3.10 |       hb711507_0          17 KB  conda-forge
2025-05-07T20:26:14.1088844Z     xcb-util-wm-0.4.2          |       hb711507_0          50 KB  conda-forge
2025-05-07T20:26:14.1089314Z     xkeyboard-config-2.44      |       hb9d3cd8_0         384 KB  conda-forge
2025-05-07T20:26:14.1089781Z     xorg-libice-1.1.2          |       hb9d3cd8_0          57 KB  conda-forge
2025-05-07T20:26:14.1090227Z     xorg-libsm-1.2.6           |       he73a12e_0          27 KB  conda-forge
2025-05-07T20:26:14.1090785Z     xorg-libx11-1.8.12         |       h4f16b4b_0         816 KB  conda-forge
2025-05-07T20:26:14.1091227Z     xorg-libxau-1.0.12         |       hb9d3cd8_0          14 KB  conda-forge
2025-05-07T20:26:14.1091709Z     xorg-libxcomposite-0.4.6   |       hb9d3cd8_2          13 KB  conda-forge
2025-05-07T20:26:14.1092210Z     xorg-libxdamage-1.1.6      |       hb9d3cd8_0          13 KB  conda-forge
2025-05-07T20:26:14.1092688Z     xorg-libxdmcp-1.1.5        |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:26:14.1093140Z     xorg-libxext-1.3.6         |       hb9d3cd8_0          49 KB  conda-forge
2025-05-07T20:26:14.1093603Z     xorg-libxfixes-6.0.1       |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:26:14.1094056Z     xorg-libxi-1.8.2           |       hb9d3cd8_0          46 KB  conda-forge
2025-05-07T20:26:14.1094501Z     xorg-libxrandr-1.5.4       |       hb9d3cd8_0          29 KB  conda-forge
2025-05-07T20:26:14.1095089Z     xorg-libxrender-0.9.12     |       hb9d3cd8_0          32 KB  conda-forge
2025-05-07T20:26:14.1095569Z     xorg-libxtst-1.2.5         |       hb9d3cd8_3          32 KB  conda-forge
2025-05-07T20:26:14.1095994Z     zlib-1.3.1                 |       hb9d3cd8_2          90 KB  conda-forge
2025-05-07T20:26:14.1096377Z     zstd-1.5.7                 |       hb8e6e7a_2         554 KB  conda-forge
2025-05-07T20:26:14.1096770Z     ------------------------------------------------------------
2025-05-07T20:26:14.1097121Z                                            Total:        1.61 GB
2025-05-07T20:26:14.1097339Z 
2025-05-07T20:26:14.1097465Z The following NEW packages will be INSTALLED:
2025-05-07T20:26:14.1097722Z 
2025-05-07T20:26:14.1097960Z   alsa-lib           conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 
2025-05-07T20:26:14.1098387Z   attr               conda-forge/linux-64::attr-2.5.1-h166bdaf_1 
2025-05-07T20:26:14.1098812Z   binutils           conda-forge/linux-64::binutils-2.40-h4852527_7 
2025-05-07T20:26:14.1099276Z   c-compiler         conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 
2025-05-07T20:26:14.1099712Z   cuda               conda-forge/noarch::cuda-12.6.3-ha804496_0 
2025-05-07T20:26:14.1100191Z   cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 
2025-05-07T20:26:14.1100802Z   cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 
2025-05-07T20:26:14.1101391Z   cuda-compiler      conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 
2025-05-07T20:26:14.1101948Z   cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:26:14.1102522Z   cuda-crt-tools     conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 
2025-05-07T20:26:14.1103227Z   cuda-cudart        conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 
2025-05-07T20:26:14.1103757Z   cuda-cudart-dev    conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 
2025-05-07T20:26:14.1104355Z   cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:26:14.1104985Z   cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 
2025-05-07T20:26:14.1105624Z   cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:26:14.1106240Z   cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:26:14.1106914Z   cuda-cuobjdump     conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 
2025-05-07T20:26:14.1107455Z   cuda-cupti         conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 
2025-05-07T20:26:14.1108035Z   cuda-cupti-dev     conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 
2025-05-07T20:26:14.1108578Z   cuda-cuxxfilt      conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 
2025-05-07T20:26:14.1109129Z   cuda-driver-dev    conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 
2025-05-07T20:26:14.1109798Z   cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 
2025-05-07T20:26:14.1110672Z   cuda-gdb           conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 
2025-05-07T20:26:14.1111353Z   cuda-libraries     conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 
2025-05-07T20:26:14.1112009Z   cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 
2025-05-07T20:26:14.1112586Z   cuda-nsight        conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 
2025-05-07T20:26:14.1113091Z   cuda-nvcc          conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 
2025-05-07T20:26:14.1113623Z   cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 
2025-05-07T20:26:14.1114375Z   cuda-nvcc-impl     conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 
2025-05-07T20:26:14.1115106Z   cuda-nvcc-tools    conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 
2025-05-07T20:26:14.1115675Z   cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 
2025-05-07T20:26:14.1116240Z   cuda-nvdisasm      conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 
2025-05-07T20:26:14.1116786Z   cuda-nvml-dev      conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 
2025-05-07T20:26:14.1117310Z   cuda-nvprof        conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 
2025-05-07T20:26:14.1117829Z   cuda-nvprune       conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 
2025-05-07T20:26:14.1118343Z   cuda-nvrtc         conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 
2025-05-07T20:26:14.1119022Z   cuda-nvrtc-dev     conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 
2025-05-07T20:26:14.1119712Z   cuda-nvtx          conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 
2025-05-07T20:26:14.1120240Z   cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 
2025-05-07T20:26:14.1120822Z   cuda-nvvm-impl     conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 
2025-05-07T20:26:14.1121397Z   cuda-nvvm-tools    conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 
2025-05-07T20:26:14.1122122Z   cuda-nvvp          conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 
2025-05-07T20:26:14.1122730Z   cuda-opencl        conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 
2025-05-07T20:26:14.1123271Z   cuda-opencl-dev    conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 
2025-05-07T20:26:14.1123867Z   cuda-profiler-api  conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 
2025-05-07T20:26:14.1124430Z   cuda-runtime       conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 
2025-05-07T20:26:14.1124992Z   cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 
2025-05-07T20:26:14.1125764Z   cuda-toolkit       conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 
2025-05-07T20:26:14.1126259Z   cuda-tools         conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 
2025-05-07T20:26:14.1126747Z   cuda-version       conda-forge/noarch::cuda-version-12.6-h7480c83_3 
2025-05-07T20:26:14.1127286Z   cuda-visual-tools  conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 
2025-05-07T20:26:14.1127852Z   cxx-compiler       conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 
2025-05-07T20:26:14.1128315Z   dbus               conda-forge/linux-64::dbus-1.13.6-h5008d03_3 
2025-05-07T20:26:14.1128839Z   font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 
2025-05-07T20:26:14.1129471Z   font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 
2025-05-07T20:26:14.1130266Z   font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 
2025-05-07T20:26:14.1130861Z   font-ttf-ubuntu    conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 
2025-05-07T20:26:14.1131370Z   fontconfig         conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 
2025-05-07T20:26:14.1131878Z   fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 
2025-05-07T20:26:14.1132382Z   fonts-conda-forge  conda-forge/noarch::fonts-conda-forge-1-0 
2025-05-07T20:26:14.1132855Z   freetype           conda-forge/linux-64::freetype-2.13.3-ha770c72_1 
2025-05-07T20:26:14.1133440Z   gcc                conda-forge/linux-64::gcc-11.4.0-h602e360_13 
2025-05-07T20:26:14.1134358Z   gds-tools          conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 
2025-05-07T20:26:14.1135034Z   gmp                conda-forge/linux-64::gmp-6.3.0-hac33072_2 
2025-05-07T20:26:14.1135423Z   gxx                conda-forge/linux-64::gxx-11.4.0-h602e360_13 
2025-05-07T20:26:14.1135841Z   keyutils           conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 
2025-05-07T20:26:14.1136265Z   krb5               conda-forge/linux-64::krb5-1.21.3-h659f571_0 
2025-05-07T20:26:14.1136676Z   libcap             conda-forge/linux-64::libcap-2.71-h39aace5_0 
2025-05-07T20:26:14.1137130Z   libcublas          conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 
2025-05-07T20:26:14.1137668Z   libcublas-dev      conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 
2025-05-07T20:26:14.1138200Z   libcufft           conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 
2025-05-07T20:26:14.1138694Z   libcufft-dev       conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 
2025-05-07T20:26:14.1139196Z   libcufile          conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 
2025-05-07T20:26:14.1139706Z   libcufile-dev      conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 
2025-05-07T20:26:14.1140223Z   libcurand          conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 
2025-05-07T20:26:14.1140747Z   libcurand-dev      conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 
2025-05-07T20:26:14.1141274Z   libcusolver        conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 
2025-05-07T20:26:14.1141823Z   libcusolver-dev    conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 
2025-05-07T20:26:14.1142375Z   libcusparse        conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 
2025-05-07T20:26:14.1142920Z   libcusparse-dev    conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 
2025-05-07T20:26:14.1143443Z   libedit            conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 
2025-05-07T20:26:14.1143942Z   libfreetype        conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 
2025-05-07T20:26:14.1144457Z   libfreetype6       conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 
2025-05-07T20:26:14.1144983Z   libgcrypt-lib      conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 
2025-05-07T20:26:14.1145470Z   libglib            conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 
2025-05-07T20:26:14.1146122Z   libgpg-error       conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 
2025-05-07T20:26:14.1146783Z   libiconv           conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 
2025-05-07T20:26:14.1147343Z   libnl              conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 
2025-05-07T20:26:14.1147775Z   libnpp             conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 
2025-05-07T20:26:14.1148250Z   libnpp-dev         conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 
2025-05-07T20:26:14.1148733Z   libnuma            conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 
2025-05-07T20:26:14.1149211Z   libnvfatbin        conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 
2025-05-07T20:26:14.1149755Z   libnvfatbin-dev    conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 
2025-05-07T20:26:14.1150306Z   libnvjitlink       conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 
2025-05-07T20:26:14.1150865Z   libnvjitlink-dev   conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 
2025-05-07T20:26:14.1151533Z   libnvjpeg          conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 
2025-05-07T20:26:14.1152068Z   libnvjpeg-dev      conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 
2025-05-07T20:26:14.1152570Z   libpng             conda-forge/linux-64::libpng-1.6.47-h943b412_0 
2025-05-07T20:26:14.1153037Z   libsystemd0        conda-forge/linux-64::libsystemd0-256.9-h2774228_0 
2025-05-07T20:26:14.1153513Z   libudev1           conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 
2025-05-07T20:26:14.1153961Z   libxcb             conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 
2025-05-07T20:26:14.1154557Z   libxkbcommon       conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 
2025-05-07T20:26:14.1155067Z   libxkbfile         conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 
2025-05-07T20:26:14.1155528Z   libxml2            conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 
2025-05-07T20:26:14.1155963Z   lz4-c              conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 
2025-05-07T20:26:14.1156475Z   nsight-compute     conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 
2025-05-07T20:26:14.1156978Z   nspr               conda-forge/linux-64::nspr-4.36-h5888daf_0 
2025-05-07T20:26:14.1157368Z   nss                conda-forge/linux-64::nss-3.111-h159eef7_0 
2025-05-07T20:26:14.1157832Z   ocl-icd            conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 
2025-05-07T20:26:14.1158347Z   opencl-headers     conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 
2025-05-07T20:26:14.1158857Z   pcre2              conda-forge/linux-64::pcre2-10.44-hc749103_2 
2025-05-07T20:26:14.1159343Z   pthread-stubs      conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 
2025-05-07T20:26:14.1159863Z   rdma-core          conda-forge/linux-64::rdma-core-55.0-h5888daf_0 
2025-05-07T20:26:14.1160320Z   wayland            conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 
2025-05-07T20:26:14.1160765Z   xcb-util           conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 
2025-05-07T20:26:14.1161292Z   xcb-util-cursor    conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 
2025-05-07T20:26:14.1161849Z   xcb-util-image     conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 
2025-05-07T20:26:14.1162412Z   xcb-util-keysyms   conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 
2025-05-07T20:26:14.1163009Z   xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 
2025-05-07T20:26:14.1163573Z   xcb-util-wm        conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 
2025-05-07T20:26:14.1164109Z   xkeyboard-config   conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 
2025-05-07T20:26:14.1164729Z   xorg-libice        conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 
2025-05-07T20:26:14.1165395Z   xorg-libsm         conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 
2025-05-07T20:26:14.1166004Z   xorg-libx11        conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 
2025-05-07T20:26:14.1166502Z   xorg-libxau        conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 
2025-05-07T20:26:14.1167074Z   xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 
2025-05-07T20:26:14.1167682Z   xorg-libxdamage    conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 
2025-05-07T20:26:14.1168265Z   xorg-libxdmcp      conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 
2025-05-07T20:26:14.1168790Z   xorg-libxext       conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 
2025-05-07T20:26:14.1169311Z   xorg-libxfixes     conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 
2025-05-07T20:26:14.1169825Z   xorg-libxi         conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 
2025-05-07T20:26:14.1170440Z   xorg-libxrandr     conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 
2025-05-07T20:26:14.1171204Z   xorg-libxrender    conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 
2025-05-07T20:26:14.1171780Z   xorg-libxtst       conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 
2025-05-07T20:26:14.1172403Z   zstd               conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 
2025-05-07T20:26:14.1172755Z 
2025-05-07T20:26:14.1173073Z The following packages will be UPDATED:
2025-05-07T20:26:14.1173291Z 
2025-05-07T20:26:14.1173464Z   libsqlite                               3.46.0-hde9e2c9_0 --> 3.49.2-hee588c1_0 
2025-05-07T20:26:14.1173881Z   libzlib                                 1.2.13-h4ab18f5_6 --> 1.3.1-hb9d3cd8_2 
2025-05-07T20:26:14.1174281Z   zlib                                    1.2.13-h4ab18f5_6 --> 1.3.1-hb9d3cd8_2 
2025-05-07T20:26:14.1174533Z 
2025-05-07T20:26:14.1174843Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:26:14.1175263Z 
2025-05-07T20:26:14.1175536Z   sqlite                pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 
2025-05-07T20:26:14.1176134Z   tk                        pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 
2025-05-07T20:26:14.1176474Z 
2025-05-07T20:26:14.1176503Z 
2025-05-07T20:26:14.1176507Z 
2025-05-07T20:26:14.1176656Z Downloading and Extracting Packages: ...working...
2025-05-07T20:26:14.1177048Z nsight-compute-2024. | 443.1 MB  |            |   0% 
2025-05-07T20:26:14.1177293Z 
2025-05-07T20:26:14.1177714Z libcublas-12.6.4.1   | 256.2 MB  |            |   0% [A
2025-05-07T20:26:14.1178004Z 
2025-05-07T20:26:14.1178009Z 
2025-05-07T20:26:14.1178225Z libcufft-11.3.0.4    | 156.2 MB  |            |   0% [A[A
2025-05-07T20:26:14.1178475Z 
2025-05-07T20:26:14.1178479Z 
2025-05-07T20:26:14.1178482Z 
2025-05-07T20:26:14.1178706Z libcusparse-12.5.4.2 | 118.6 MB  |            |   0% [A[A[A
2025-05-07T20:26:14.1178974Z 
2025-05-07T20:26:14.1178984Z 
2025-05-07T20:26:14.1178988Z 
2025-05-07T20:26:14.1178992Z 
2025-05-07T20:26:14.1179224Z cuda-nsight-12.6.77  | 113.2 MB  |            |   0% [A[A[A[A
2025-05-07T20:26:14.1179498Z 
2025-05-07T20:26:14.1179502Z 
2025-05-07T20:26:14.1179505Z 
2025-05-07T20:26:14.1179509Z 
2025-05-07T20:26:14.1179512Z 
2025-05-07T20:26:14.1188655Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:26:14.1189052Z 
2025-05-07T20:26:14.1189058Z 
2025-05-07T20:26:14.1189063Z 
2025-05-07T20:26:14.1189068Z 
2025-05-07T20:26:14.1189073Z 
2025-05-07T20:26:14.1191922Z 
2025-05-07T20:26:14.1197740Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:14.1198158Z 
2025-05-07T20:26:14.1198163Z 
2025-05-07T20:26:14.1198168Z 
2025-05-07T20:26:14.1198174Z 
2025-05-07T20:26:14.1198180Z 
2025-05-07T20:26:14.1198185Z 
2025-05-07T20:26:14.1198192Z 
2025-05-07T20:26:14.1200136Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:26:14.1200553Z 
2025-05-07T20:26:14.1200559Z 
2025-05-07T20:26:14.1200564Z 
2025-05-07T20:26:14.1200569Z 
2025-05-07T20:26:14.1200574Z 
2025-05-07T20:26:14.1200580Z 
2025-05-07T20:26:14.1200585Z 
2025-05-07T20:26:14.1200590Z 
2025-05-07T20:26:14.1202201Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:26:14.1202615Z 
2025-05-07T20:26:14.1202620Z 
2025-05-07T20:26:14.1202625Z 
2025-05-07T20:26:14.1202638Z 
2025-05-07T20:26:14.1202643Z 
2025-05-07T20:26:14.1202648Z 
2025-05-07T20:26:14.1202653Z 
2025-05-07T20:26:14.1202658Z 
2025-05-07T20:26:14.1202663Z 
2025-05-07T20:26:14.1204568Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.1204974Z 
2025-05-07T20:26:14.1204979Z 
2025-05-07T20:26:14.1204984Z 
2025-05-07T20:26:14.1204989Z 
2025-05-07T20:26:14.1204994Z 
2025-05-07T20:26:14.1204999Z 
2025-05-07T20:26:14.1205013Z 
2025-05-07T20:26:14.1205019Z 
2025-05-07T20:26:14.1205024Z 
2025-05-07T20:26:14.1205028Z 
2025-05-07T20:26:14.1206704Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.1207124Z 
2025-05-07T20:26:14.1207129Z 
2025-05-07T20:26:14.1207134Z 
2025-05-07T20:26:14.1207139Z 
2025-05-07T20:26:14.1207144Z 
2025-05-07T20:26:14.1207149Z 
2025-05-07T20:26:14.1207154Z 
2025-05-07T20:26:14.1207159Z 
2025-05-07T20:26:14.1207164Z 
2025-05-07T20:26:14.1207169Z 
2025-05-07T20:26:14.1207174Z 
2025-05-07T20:26:14.1208016Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.1208349Z 
2025-05-07T20:26:14.1208353Z 
2025-05-07T20:26:14.1208357Z 
2025-05-07T20:26:14.1208361Z 
2025-05-07T20:26:14.1208364Z 
2025-05-07T20:26:14.1208374Z 
2025-05-07T20:26:14.1208378Z 
2025-05-07T20:26:14.1208381Z 
2025-05-07T20:26:14.1208384Z 
2025-05-07T20:26:14.1208388Z 
2025-05-07T20:26:14.1208392Z 
2025-05-07T20:26:14.1208399Z 
2025-05-07T20:26:14.1208835Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.1209234Z 
2025-05-07T20:26:14.1209238Z 
2025-05-07T20:26:14.1209242Z 
2025-05-07T20:26:14.1209245Z 
2025-05-07T20:26:14.1209249Z 
2025-05-07T20:26:14.1209252Z 
2025-05-07T20:26:14.1209256Z 
2025-05-07T20:26:14.1209259Z 
2025-05-07T20:26:14.1209263Z 
2025-05-07T20:26:14.1209266Z 
2025-05-07T20:26:14.1209273Z 
2025-05-07T20:26:14.1209276Z 
2025-05-07T20:26:14.1209280Z 
2025-05-07T20:26:14.1210659Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.1211087Z 
2025-05-07T20:26:14.1211092Z 
2025-05-07T20:26:14.1211095Z 
2025-05-07T20:26:14.1211099Z 
2025-05-07T20:26:14.1211102Z 
2025-05-07T20:26:14.1211106Z 
2025-05-07T20:26:14.1211109Z 
2025-05-07T20:26:14.1211113Z 
2025-05-07T20:26:14.1211117Z 
2025-05-07T20:26:14.1211120Z 
2025-05-07T20:26:14.1211124Z 
2025-05-07T20:26:14.1211137Z 
2025-05-07T20:26:14.1211140Z 
2025-05-07T20:26:14.1211144Z 
2025-05-07T20:26:14.1211850Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.1212304Z 
2025-05-07T20:26:14.1212310Z 
2025-05-07T20:26:14.1212315Z 
2025-05-07T20:26:14.1212319Z 
2025-05-07T20:26:14.1212324Z 
2025-05-07T20:26:14.1212329Z 
2025-05-07T20:26:14.1212334Z 
2025-05-07T20:26:14.1212339Z 
2025-05-07T20:26:14.1212344Z 
2025-05-07T20:26:14.1212359Z 
2025-05-07T20:26:14.1212364Z 
2025-05-07T20:26:14.1212369Z 
2025-05-07T20:26:14.1212378Z 
2025-05-07T20:26:14.1212389Z 
2025-05-07T20:26:14.1212394Z 
2025-05-07T20:26:14.1213969Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.1214429Z 
2025-05-07T20:26:14.1214434Z 
2025-05-07T20:26:14.1214439Z 
2025-05-07T20:26:14.1214444Z 
2025-05-07T20:26:14.1214449Z 
2025-05-07T20:26:14.1214454Z 
2025-05-07T20:26:14.1214459Z 
2025-05-07T20:26:14.1214464Z 
2025-05-07T20:26:14.1214469Z 
2025-05-07T20:26:14.1214474Z 
2025-05-07T20:26:14.1214479Z 
2025-05-07T20:26:14.1214484Z 
2025-05-07T20:26:14.1214488Z 
2025-05-07T20:26:14.1214503Z 
2025-05-07T20:26:14.1214507Z 
2025-05-07T20:26:14.1214512Z 
2025-05-07T20:26:14.1215494Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.1215944Z 
2025-05-07T20:26:14.1215949Z 
2025-05-07T20:26:14.1215962Z 
2025-05-07T20:26:14.1215967Z 
2025-05-07T20:26:14.1215972Z 
2025-05-07T20:26:14.1215977Z 
2025-05-07T20:26:14.1215981Z 
2025-05-07T20:26:14.1215986Z 
2025-05-07T20:26:14.1215998Z 
2025-05-07T20:26:14.1216013Z 
2025-05-07T20:26:14.1216018Z 
2025-05-07T20:26:14.1216023Z 
2025-05-07T20:26:14.1216028Z 
2025-05-07T20:26:14.1216033Z 
2025-05-07T20:26:14.1216038Z 
2025-05-07T20:26:14.1216042Z 
2025-05-07T20:26:14.1216598Z 
2025-05-07T20:26:14.1218620Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.1219077Z 
2025-05-07T20:26:14.1219083Z 
2025-05-07T20:26:14.1219088Z 
2025-05-07T20:26:14.1219092Z 
2025-05-07T20:26:14.1219097Z 
2025-05-07T20:26:14.1219102Z 
2025-05-07T20:26:14.1219116Z 
2025-05-07T20:26:14.1219121Z 
2025-05-07T20:26:14.1219126Z 
2025-05-07T20:26:14.1219131Z 
2025-05-07T20:26:14.1219136Z 
2025-05-07T20:26:14.1219141Z 
2025-05-07T20:26:14.1219146Z 
2025-05-07T20:26:14.1219151Z 
2025-05-07T20:26:14.1219156Z 
2025-05-07T20:26:14.1219161Z 
2025-05-07T20:26:14.1219166Z 
2025-05-07T20:26:14.1219171Z 
2025-05-07T20:26:14.1220402Z libglib-2.84.0       | 3.8 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.1220822Z 
2025-05-07T20:26:14.1220827Z 
2025-05-07T20:26:14.1220842Z 
2025-05-07T20:26:14.1220857Z 
2025-05-07T20:26:14.1220862Z 
2025-05-07T20:26:14.1220867Z 
2025-05-07T20:26:14.1220872Z 
2025-05-07T20:26:14.1220877Z 
2025-05-07T20:26:14.1220881Z 
2025-05-07T20:26:14.1220886Z 
2025-05-07T20:26:14.1220891Z 
2025-05-07T20:26:14.1220896Z 
2025-05-07T20:26:14.1220901Z 
2025-05-07T20:26:14.1220906Z 
2025-05-07T20:26:14.1220911Z 
2025-05-07T20:26:14.1220916Z 
2025-05-07T20:26:14.1220921Z 
2025-05-07T20:26:14.1221016Z 
2025-05-07T20:26:14.1221021Z 
2025-05-07T20:26:14.2121606Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:14.2122192Z 
2025-05-07T20:26:14.2142622Z libcublas-12.6.4.1   | 256.2 MB  | 1          |   1% [A
2025-05-07T20:26:14.2143010Z 
2025-05-07T20:26:14.2143016Z 
2025-05-07T20:26:14.2148089Z libcufft-11.3.0.4    | 156.2 MB  |            |   0% [A[A
2025-05-07T20:26:14.2148519Z 
2025-05-07T20:26:14.2148526Z 
2025-05-07T20:26:14.2148540Z 
2025-05-07T20:26:14.2166762Z libcusparse-12.5.4.2 | 118.6 MB  | 1          |   2% [A[A[A
2025-05-07T20:26:14.2167201Z 
2025-05-07T20:26:14.2167207Z 
2025-05-07T20:26:14.2167213Z 
2025-05-07T20:26:14.2167218Z 
2025-05-07T20:26:14.2188254Z cuda-nsight-12.6.77  | 113.2 MB  | 1          |   1% [A[A[A[A
2025-05-07T20:26:14.3151974Z nsight-compute-2024. | 443.1 MB  |            |   0% 
2025-05-07T20:26:14.3152384Z 
2025-05-07T20:26:14.3152392Z 
2025-05-07T20:26:14.3154525Z 
2025-05-07T20:26:14.3171117Z libcusparse-12.5.4.2 | 118.6 MB  | 4          |   5% [A[A[A
2025-05-07T20:26:14.3171473Z 
2025-05-07T20:26:14.3171477Z 
2025-05-07T20:26:14.3171480Z 
2025-05-07T20:26:14.3171484Z 
2025-05-07T20:26:14.3183210Z cuda-nsight-12.6.77  | 113.2 MB  | 4          |   4% [A[A[A[A
2025-05-07T20:26:14.3183511Z 
2025-05-07T20:26:14.3186007Z 
2025-05-07T20:26:14.3193748Z libcufft-11.3.0.4    | 156.2 MB  | 2          |   2% [A[A
2025-05-07T20:26:14.3209625Z nsight-compute-2024. | 443.1 MB  |            |   1% 
2025-05-07T20:26:14.3211112Z 
2025-05-07T20:26:14.4155698Z libcublas-12.6.4.1   | 256.2 MB  | 2          |   3% [A
2025-05-07T20:26:14.4155996Z 
2025-05-07T20:26:14.4156001Z 
2025-05-07T20:26:14.4157754Z 
2025-05-07T20:26:14.4174645Z libcusparse-12.5.4.2 | 118.6 MB  | 7          |   8% [A[A[A
2025-05-07T20:26:14.4174946Z 
2025-05-07T20:26:14.4174951Z 
2025-05-07T20:26:14.4174954Z 
2025-05-07T20:26:14.4177040Z 
2025-05-07T20:26:14.4186602Z cuda-nsight-12.6.77  | 113.2 MB  | 7          |   7% [A[A[A[A
2025-05-07T20:26:14.4188189Z 
2025-05-07T20:26:14.4188194Z 
2025-05-07T20:26:14.4195974Z libcufft-11.3.0.4    | 156.2 MB  | 4          |   5% [A[A
2025-05-07T20:26:14.4289790Z nsight-compute-2024. | 443.1 MB  | 1          |   1% 
2025-05-07T20:26:14.4291266Z 
2025-05-07T20:26:14.5157791Z libcublas-12.6.4.1   | 256.2 MB  | 4          |   4% [A
2025-05-07T20:26:14.5158152Z 
2025-05-07T20:26:14.5158169Z 
2025-05-07T20:26:14.5158180Z 
2025-05-07T20:26:14.5176048Z libcusparse-12.5.4.2 | 118.6 MB  | #          |  10% [A[A[A
2025-05-07T20:26:14.5176430Z 
2025-05-07T20:26:14.5176436Z 
2025-05-07T20:26:14.5176441Z 
2025-05-07T20:26:14.5179795Z 
2025-05-07T20:26:14.5188326Z cuda-nsight-12.6.77  | 113.2 MB  | #          |  10% [A[A[A[A
2025-05-07T20:26:14.5188670Z 
2025-05-07T20:26:14.5189298Z 
2025-05-07T20:26:14.5199380Z libcufft-11.3.0.4    | 156.2 MB  | 6          |   7% [A[A
2025-05-07T20:26:14.5410140Z nsight-compute-2024. | 443.1 MB  | 2          |   2% 
2025-05-07T20:26:14.5411724Z 
2025-05-07T20:26:14.6163123Z libcublas-12.6.4.1   | 256.2 MB  | 5          |   6% [A
2025-05-07T20:26:14.6163413Z 
2025-05-07T20:26:14.6163417Z 
2025-05-07T20:26:14.6163421Z 
2025-05-07T20:26:14.6178120Z libcusparse-12.5.4.2 | 118.6 MB  | #3         |  13% [A[A[A
2025-05-07T20:26:14.6178551Z 
2025-05-07T20:26:14.6178555Z 
2025-05-07T20:26:14.6178559Z 
2025-05-07T20:26:14.6178563Z 
2025-05-07T20:26:14.6189012Z cuda-nsight-12.6.77  | 113.2 MB  | #3         |  14% [A[A[A[A
2025-05-07T20:26:14.6189312Z 
2025-05-07T20:26:14.6189944Z 
2025-05-07T20:26:14.6201839Z libcufft-11.3.0.4    | 156.2 MB  | 9          |   9% [A[A
2025-05-07T20:26:14.6419427Z nsight-compute-2024. | 443.1 MB  | 2          |   3% 
2025-05-07T20:26:14.6420487Z 
2025-05-07T20:26:14.7164448Z libcublas-12.6.4.1   | 256.2 MB  | 7          |   7% [A
2025-05-07T20:26:14.7164733Z 
2025-05-07T20:26:14.7164737Z 
2025-05-07T20:26:14.7164744Z 
2025-05-07T20:26:14.7179492Z libcusparse-12.5.4.2 | 118.6 MB  | #6         |  17% [A[A[A
2025-05-07T20:26:14.7179809Z 
2025-05-07T20:26:14.7180096Z 
2025-05-07T20:26:14.7180102Z 
2025-05-07T20:26:14.7181782Z 
2025-05-07T20:26:14.7188883Z cuda-nsight-12.6.77  | 113.2 MB  | #6         |  17% [A[A[A[A
2025-05-07T20:26:14.7189177Z 
2025-05-07T20:26:14.7190293Z 
2025-05-07T20:26:14.7203222Z libcufft-11.3.0.4    | 156.2 MB  | #1         |  12% [A[A
2025-05-07T20:26:14.7422192Z nsight-compute-2024. | 443.1 MB  | 3          |   4% 
2025-05-07T20:26:14.7425330Z 
2025-05-07T20:26:14.8176477Z libcublas-12.6.4.1   | 256.2 MB  | 8          |   9% [A
2025-05-07T20:26:14.8176866Z 
2025-05-07T20:26:14.8176871Z 
2025-05-07T20:26:14.8176882Z 
2025-05-07T20:26:14.8182505Z libcusparse-12.5.4.2 | 118.6 MB  | #9         |  20% [A[A[A
2025-05-07T20:26:14.8182815Z 
2025-05-07T20:26:14.8182820Z 
2025-05-07T20:26:14.8182825Z 
2025-05-07T20:26:14.8185176Z 
2025-05-07T20:26:14.8204778Z cuda-nsight-12.6.77  | 113.2 MB  | ##         |  20% [A[A[A[A
2025-05-07T20:26:14.8205077Z 
2025-05-07T20:26:14.8205082Z 
2025-05-07T20:26:14.8209842Z libcufft-11.3.0.4    | 156.2 MB  | #3         |  14% [A[A
2025-05-07T20:26:14.8423073Z nsight-compute-2024. | 443.1 MB  | 4          |   4% 
2025-05-07T20:26:14.8424104Z 
2025-05-07T20:26:14.9186948Z libcublas-12.6.4.1   | 256.2 MB  | 9          |  10% [A
2025-05-07T20:26:14.9187248Z 
2025-05-07T20:26:14.9187252Z 
2025-05-07T20:26:14.9187256Z 
2025-05-07T20:26:14.9188003Z 
2025-05-07T20:26:14.9193734Z cuda-nsight-12.6.77  | 113.2 MB  | ##3        |  23% [A[A[A[A
2025-05-07T20:26:14.9194023Z 
2025-05-07T20:26:14.9194028Z 
2025-05-07T20:26:14.9194031Z 
2025-05-07T20:26:14.9205387Z libcusparse-12.5.4.2 | 118.6 MB  | ##2        |  23% [A[A[A
2025-05-07T20:26:14.9205665Z 
2025-05-07T20:26:14.9205669Z 
2025-05-07T20:26:14.9209102Z libcufft-11.3.0.4    | 156.2 MB  | #5         |  16% [A[A
2025-05-07T20:26:14.9466763Z nsight-compute-2024. | 443.1 MB  | 5          |   5% 
2025-05-07T20:26:14.9467084Z 
2025-05-07T20:26:15.0195474Z libcublas-12.6.4.1   | 256.2 MB  | #1         |  11% [A
2025-05-07T20:26:15.0195768Z 
2025-05-07T20:26:15.0195801Z 
2025-05-07T20:26:15.0198650Z 
2025-05-07T20:26:15.0212128Z libcusparse-12.5.4.2 | 118.6 MB  | ##6        |  26% [A[A[A
2025-05-07T20:26:15.0274653Z nsight-compute-2024. | 443.1 MB  | 5          |   6% 
2025-05-07T20:26:15.0275026Z 
2025-05-07T20:26:15.0278010Z 
2025-05-07T20:26:15.0316190Z libcufft-11.3.0.4    | 156.2 MB  | #8         |  18% [A[A
2025-05-07T20:26:15.0316456Z 
2025-05-07T20:26:15.0316478Z 
2025-05-07T20:26:15.0316492Z 
2025-05-07T20:26:15.0317488Z 
2025-05-07T20:26:15.0508984Z cuda-nsight-12.6.77  | 113.2 MB  | ##6        |  26% [A[A[A[A
2025-05-07T20:26:15.0511519Z 
2025-05-07T20:26:15.1197200Z libcublas-12.6.4.1   | 256.2 MB  | #2         |  13% [A
2025-05-07T20:26:15.1197483Z 
2025-05-07T20:26:15.1197487Z 
2025-05-07T20:26:15.1197843Z 
2025-05-07T20:26:15.1219648Z libcusparse-12.5.4.2 | 118.6 MB  | ##9        |  29% [A[A[A
2025-05-07T20:26:15.1277671Z nsight-compute-2024. | 443.1 MB  | 6          |   7% 
2025-05-07T20:26:15.1278039Z 
2025-05-07T20:26:15.1279821Z 
2025-05-07T20:26:15.1318642Z libcufft-11.3.0.4    | 156.2 MB  | ##         |  20% [A[A
2025-05-07T20:26:15.1319367Z 
2025-05-07T20:26:15.1319372Z 
2025-05-07T20:26:15.1319375Z 
2025-05-07T20:26:15.1322471Z 
2025-05-07T20:26:15.1511431Z cuda-nsight-12.6.77  | 113.2 MB  | ##9        |  29% [A[A[A[A
2025-05-07T20:26:15.1512941Z 
2025-05-07T20:26:15.2219858Z libcublas-12.6.4.1   | 256.2 MB  | #4         |  14% [A
2025-05-07T20:26:15.2220480Z 
2025-05-07T20:26:15.2220486Z 
2025-05-07T20:26:15.2221334Z 
2025-05-07T20:26:15.2223737Z libcusparse-12.5.4.2 | 118.6 MB  | ###2       |  32% [A[A[A
2025-05-07T20:26:15.2284039Z nsight-compute-2024. | 443.1 MB  | 7          |   8% 
2025-05-07T20:26:15.2284519Z 
2025-05-07T20:26:15.2286956Z 
2025-05-07T20:26:15.2321156Z libcufft-11.3.0.4    | 156.2 MB  | ##2        |  22% [A[A
2025-05-07T20:26:15.2321528Z 
2025-05-07T20:26:15.2321532Z 
2025-05-07T20:26:15.2321535Z 
2025-05-07T20:26:15.2321539Z 
2025-05-07T20:26:15.2516243Z cuda-nsight-12.6.77  | 113.2 MB  | ###2       |  33% [A[A[A[A
2025-05-07T20:26:15.2518077Z 
2025-05-07T20:26:15.3251050Z libcublas-12.6.4.1   | 256.2 MB  | #5         |  15% [A
2025-05-07T20:26:15.3275029Z nsight-compute-2024. | 443.1 MB  | 8          |   8% 
2025-05-07T20:26:15.3275386Z 
2025-05-07T20:26:15.3275392Z 
2025-05-07T20:26:15.3275397Z 
2025-05-07T20:26:15.3309389Z libcusparse-12.5.4.2 | 118.6 MB  | ###5       |  35% [A[A[A
2025-05-07T20:26:15.3309736Z 
2025-05-07T20:26:15.3310992Z 
2025-05-07T20:26:15.3358433Z libcufft-11.3.0.4    | 156.2 MB  | ##4        |  25% [A[A
2025-05-07T20:26:15.3358755Z 
2025-05-07T20:26:15.3358759Z 
2025-05-07T20:26:15.3358765Z 
2025-05-07T20:26:15.3358769Z 
2025-05-07T20:26:15.3604659Z cuda-nsight-12.6.77  | 113.2 MB  | ###5       |  36% [A[A[A[A
2025-05-07T20:26:15.3607071Z 
2025-05-07T20:26:15.4254699Z libcublas-12.6.4.1   | 256.2 MB  | #6         |  17% [A
2025-05-07T20:26:15.4276584Z nsight-compute-2024. | 443.1 MB  | 9          |   9% 
2025-05-07T20:26:15.4276916Z 
2025-05-07T20:26:15.4276947Z 
2025-05-07T20:26:15.4276951Z 
2025-05-07T20:26:15.4313029Z libcusparse-12.5.4.2 | 118.6 MB  | ###8       |  38% [A[A[A
2025-05-07T20:26:15.4313375Z 
2025-05-07T20:26:15.4314997Z 
2025-05-07T20:26:15.4380158Z libcufft-11.3.0.4    | 156.2 MB  | ##6        |  27% [A[A
2025-05-07T20:26:15.4380497Z 
2025-05-07T20:26:15.4380503Z 
2025-05-07T20:26:15.4380508Z 
2025-05-07T20:26:15.4381682Z 
2025-05-07T20:26:15.4606800Z cuda-nsight-12.6.77  | 113.2 MB  | ###8       |  39% [A[A[A[A
2025-05-07T20:26:15.4608872Z 
2025-05-07T20:26:15.5281596Z libcublas-12.6.4.1   | 256.2 MB  | #8         |  18% [A
2025-05-07T20:26:15.5281888Z 
2025-05-07T20:26:15.5281891Z 
2025-05-07T20:26:15.5283507Z 
2025-05-07T20:26:15.5337047Z libcusparse-12.5.4.2 | 118.6 MB  | ####1      |  42% [A[A[A
2025-05-07T20:26:15.5368171Z nsight-compute-2024. | 443.1 MB  | 9          |  10% 
2025-05-07T20:26:15.5368457Z 
2025-05-07T20:26:15.5368461Z 
2025-05-07T20:26:15.5383556Z libcufft-11.3.0.4    | 156.2 MB  | ##9        |  29% [A[A
2025-05-07T20:26:15.5383948Z 
2025-05-07T20:26:15.5383952Z 
2025-05-07T20:26:15.5383956Z 
2025-05-07T20:26:15.5383960Z 
2025-05-07T20:26:15.5607703Z cuda-nsight-12.6.77  | 113.2 MB  | ####1      |  42% [A[A[A[A
2025-05-07T20:26:15.5610006Z 
2025-05-07T20:26:15.6342881Z libcublas-12.6.4.1   | 256.2 MB  | #9         |  20% [A
2025-05-07T20:26:15.6344767Z nsight-compute-2024. | 443.1 MB  | #          |  11% 
2025-05-07T20:26:15.6345059Z 
2025-05-07T20:26:15.6345064Z 
2025-05-07T20:26:15.6346710Z 
2025-05-07T20:26:15.6370220Z libcusparse-12.5.4.2 | 118.6 MB  | ####4      |  45% [A[A[A
2025-05-07T20:26:15.6370669Z 
2025-05-07T20:26:15.6370675Z 
2025-05-07T20:26:15.6400554Z libcufft-11.3.0.4    | 156.2 MB  | ###1       |  31% [A[A
2025-05-07T20:26:15.6400949Z 
2025-05-07T20:26:15.6400953Z 
2025-05-07T20:26:15.6400957Z 
2025-05-07T20:26:15.6402313Z 
2025-05-07T20:26:15.6682546Z cuda-nsight-12.6.77  | 113.2 MB  | ####5      |  45% [A[A[A[A
2025-05-07T20:26:15.6682878Z 
2025-05-07T20:26:15.7350661Z libcublas-12.6.4.1   | 256.2 MB  | ##         |  21% [A
2025-05-07T20:26:15.7372310Z nsight-compute-2024. | 443.1 MB  | #1         |  11% 
2025-05-07T20:26:15.7372691Z 
2025-05-07T20:26:15.7372697Z 
2025-05-07T20:26:15.7385709Z libcufft-11.3.0.4    | 156.2 MB  | ###3       |  34% [A[A
2025-05-07T20:26:15.7386016Z 
2025-05-07T20:26:15.7386020Z 
2025-05-07T20:26:15.7387212Z 
2025-05-07T20:26:15.7434990Z libcusparse-12.5.4.2 | 118.6 MB  | ####8      |  48% [A[A[A
2025-05-07T20:26:15.7435310Z 
2025-05-07T20:26:15.7435316Z 
2025-05-07T20:26:15.7435321Z 
2025-05-07T20:26:15.7439210Z 
2025-05-07T20:26:15.7687243Z cuda-nsight-12.6.77  | 113.2 MB  | ####8      |  48% [A[A[A[A
2025-05-07T20:26:15.7688824Z 
2025-05-07T20:26:15.8372908Z libcublas-12.6.4.1   | 256.2 MB  | ##2        |  22% [A
2025-05-07T20:26:15.8452221Z nsight-compute-2024. | 443.1 MB  | #2         |  12% 
2025-05-07T20:26:15.8452600Z 
2025-05-07T20:26:15.8452607Z 
2025-05-07T20:26:15.8457976Z libcufft-11.3.0.4    | 156.2 MB  | ###6       |  36% [A[A
2025-05-07T20:26:15.8458635Z 
2025-05-07T20:26:15.8458639Z 
2025-05-07T20:26:15.8458643Z 
2025-05-07T20:26:15.8467148Z libcusparse-12.5.4.2 | 118.6 MB  | #####1     |  51% [A[A[A
2025-05-07T20:26:15.8467509Z 
2025-05-07T20:26:15.8467513Z 
2025-05-07T20:26:15.8467517Z 
2025-05-07T20:26:15.8467521Z 
2025-05-07T20:26:15.8731307Z cuda-nsight-12.6.77  | 113.2 MB  | #####1     |  51% [A[A[A[A
2025-05-07T20:26:15.8731616Z 
2025-05-07T20:26:15.9373832Z libcublas-12.6.4.1   | 256.2 MB  | ##3        |  24% [A
2025-05-07T20:26:15.9458048Z nsight-compute-2024. | 443.1 MB  | #2         |  13% 
2025-05-07T20:26:15.9458420Z 
2025-05-07T20:26:15.9458426Z 
2025-05-07T20:26:15.9459621Z 
2025-05-07T20:26:15.9477077Z libcusparse-12.5.4.2 | 118.6 MB  | #####4     |  54% [A[A[A
2025-05-07T20:26:15.9477361Z 
2025-05-07T20:26:15.9480088Z 
2025-05-07T20:26:15.9731394Z libcufft-11.3.0.4    | 156.2 MB  | ###8       |  38% [A[A
2025-05-07T20:26:15.9732859Z 
2025-05-07T20:26:15.9975901Z libcublas-12.6.4.1   | 256.2 MB  | ##5        |  25% [A
2025-05-07T20:26:15.9976208Z 
2025-05-07T20:26:15.9976212Z 
2025-05-07T20:26:15.9976216Z 
2025-05-07T20:26:15.9976219Z 
2025-05-07T20:26:16.0417727Z cuda-nsight-12.6.77  | 113.2 MB  | #####4     |  54% [A[A[A[A
2025-05-07T20:26:16.0481644Z nsight-compute-2024. | 443.1 MB  | #3         |  14% 
2025-05-07T20:26:16.0481903Z 
2025-05-07T20:26:16.0484460Z 
2025-05-07T20:26:16.0492113Z libcufft-11.3.0.4    | 156.2 MB  | ####       |  41% [A[A
2025-05-07T20:26:16.0492382Z 
2025-05-07T20:26:16.0492387Z 
2025-05-07T20:26:16.0493904Z 
2025-05-07T20:26:16.0771155Z libcusparse-12.5.4.2 | 118.6 MB  | #####7     |  58% [A[A[A
2025-05-07T20:26:16.0771447Z 
2025-05-07T20:26:16.0978314Z libcublas-12.6.4.1   | 256.2 MB  | ##6        |  26% [A
2025-05-07T20:26:16.0978753Z 
2025-05-07T20:26:16.0978759Z 
2025-05-07T20:26:16.0978764Z 
2025-05-07T20:26:16.0981091Z 
2025-05-07T20:26:16.1418088Z cuda-nsight-12.6.77  | 113.2 MB  | #####7     |  57% [A[A[A[A
2025-05-07T20:26:16.1483370Z nsight-compute-2024. | 443.1 MB  | #4         |  15% 
2025-05-07T20:26:16.1483728Z 
2025-05-07T20:26:16.1485256Z 
2025-05-07T20:26:16.1553024Z libcufft-11.3.0.4    | 156.2 MB  | ####3      |  43% [A[A
2025-05-07T20:26:16.1553323Z 
2025-05-07T20:26:16.1553327Z 
2025-05-07T20:26:16.1553330Z 
2025-05-07T20:26:16.1772165Z libcusparse-12.5.4.2 | 118.6 MB  | ######     |  61% [A[A[A
2025-05-07T20:26:16.1772820Z 
2025-05-07T20:26:16.1981974Z libcublas-12.6.4.1   | 256.2 MB  | ##7        |  28% [A
2025-05-07T20:26:16.1982244Z 
2025-05-07T20:26:16.1982248Z 
2025-05-07T20:26:16.1982252Z 
2025-05-07T20:26:16.1982894Z 
2025-05-07T20:26:16.2486452Z cuda-nsight-12.6.77  | 113.2 MB  | ######     |  61% [A[A[A[A
2025-05-07T20:26:16.2486829Z 
2025-05-07T20:26:16.2489226Z 
2025-05-07T20:26:16.2554525Z libcufft-11.3.0.4    | 156.2 MB  | ####5      |  46% [A[A
2025-05-07T20:26:16.2554806Z 
2025-05-07T20:26:16.2554809Z 
2025-05-07T20:26:16.2555404Z 
2025-05-07T20:26:16.2778644Z libcusparse-12.5.4.2 | 118.6 MB  | ######4    |  64% [A[A[A
2025-05-07T20:26:16.2780075Z 
2025-05-07T20:26:16.3019402Z libcublas-12.6.4.1   | 256.2 MB  | ##9        |  29% [A
2025-05-07T20:26:16.3019690Z 
2025-05-07T20:26:16.3019695Z 
2025-05-07T20:26:16.3019698Z 
2025-05-07T20:26:16.3021477Z 
2025-05-07T20:26:16.3157114Z cuda-nsight-12.6.77  | 113.2 MB  | ######3    |  64% [A[A[A[A
2025-05-07T20:26:16.3523904Z nsight-compute-2024. | 443.1 MB  | #5         |  15% 
2025-05-07T20:26:16.3524435Z 
2025-05-07T20:26:16.3525931Z 
2025-05-07T20:26:16.3635143Z libcufft-11.3.0.4    | 156.2 MB  | ####7      |  48% [A[A
2025-05-07T20:26:16.3635518Z 
2025-05-07T20:26:16.3635522Z 
2025-05-07T20:26:16.3635525Z 
2025-05-07T20:26:16.3864681Z libcusparse-12.5.4.2 | 118.6 MB  | ######7    |  67% [A[A[A
2025-05-07T20:26:16.3865019Z 
2025-05-07T20:26:16.4099847Z libcublas-12.6.4.1   | 256.2 MB  | ###        |  31% [A
2025-05-07T20:26:16.4100147Z 
2025-05-07T20:26:16.4100164Z 
2025-05-07T20:26:16.4100170Z 
2025-05-07T20:26:16.4104624Z 
2025-05-07T20:26:16.4159151Z cuda-nsight-12.6.77  | 113.2 MB  | ######6    |  67% [A[A[A[A
2025-05-07T20:26:16.4524256Z nsight-compute-2024. | 443.1 MB  | #6         |  16% 
2025-05-07T20:26:16.4524601Z 
2025-05-07T20:26:16.4528413Z 
2025-05-07T20:26:16.4635501Z libcufft-11.3.0.4    | 156.2 MB  | #####      |  50% [A[A
2025-05-07T20:26:16.4635776Z 
2025-05-07T20:26:16.4635782Z 
2025-05-07T20:26:16.4637204Z 
2025-05-07T20:26:16.4866194Z libcusparse-12.5.4.2 | 118.6 MB  | #######    |  71% [A[A[A
2025-05-07T20:26:16.4867641Z 
2025-05-07T20:26:16.5101610Z libcublas-12.6.4.1   | 256.2 MB  | ###2       |  32% [A
2025-05-07T20:26:16.5101891Z 
2025-05-07T20:26:16.5101895Z 
2025-05-07T20:26:16.5101899Z 
2025-05-07T20:26:16.5102417Z 
2025-05-07T20:26:16.5526050Z cuda-nsight-12.6.77  | 113.2 MB  | ######9    |  70% [A[A[A[A
2025-05-07T20:26:16.5526380Z 
2025-05-07T20:26:16.5526385Z 
2025-05-07T20:26:16.5638817Z libcufft-11.3.0.4    | 156.2 MB  | #####2     |  53% [A[A
2025-05-07T20:26:16.5639138Z 
2025-05-07T20:26:16.5639169Z 
2025-05-07T20:26:16.5639173Z 
2025-05-07T20:26:16.5888716Z libcusparse-12.5.4.2 | 118.6 MB  | #######4   |  74% [A[A[A
2025-05-07T20:26:16.5892809Z 
2025-05-07T20:26:16.6103473Z libcublas-12.6.4.1   | 256.2 MB  | ###3       |  33% [A
2025-05-07T20:26:16.6103760Z 
2025-05-07T20:26:16.6103772Z 
2025-05-07T20:26:16.6103776Z 
2025-05-07T20:26:16.6106548Z 
2025-05-07T20:26:16.6636401Z cuda-nsight-12.6.77  | 113.2 MB  | #######2   |  73% [A[A[A[A
2025-05-07T20:26:16.6636808Z 
2025-05-07T20:26:16.6638607Z 
2025-05-07T20:26:16.6691448Z libcufft-11.3.0.4    | 156.2 MB  | #####5     |  55% [A[A
2025-05-07T20:26:16.6691814Z 
2025-05-07T20:26:16.6691819Z 
2025-05-07T20:26:16.6691825Z 
2025-05-07T20:26:16.6815223Z libcusparse-12.5.4.2 | 118.6 MB  | #######7   |  77% [A[A[A
2025-05-07T20:26:16.6891567Z nsight-compute-2024. | 443.1 MB  | #6         |  17% 
2025-05-07T20:26:16.6892355Z 
2025-05-07T20:26:16.7104392Z libcublas-12.6.4.1   | 256.2 MB  | ###4       |  35% [A
2025-05-07T20:26:16.7104783Z 
2025-05-07T20:26:16.7104822Z 
2025-05-07T20:26:16.7104827Z 
2025-05-07T20:26:16.7106463Z 
2025-05-07T20:26:16.7646220Z cuda-nsight-12.6.77  | 113.2 MB  | #######6   |  76% [A[A[A[A
2025-05-07T20:26:16.7646617Z 
2025-05-07T20:26:16.7646623Z 
2025-05-07T20:26:16.7893028Z libcufft-11.3.0.4    | 156.2 MB  | #####7     |  58% [A[A
2025-05-07T20:26:16.7893484Z 
2025-05-07T20:26:16.7922153Z libcublas-12.6.4.1   | 256.2 MB  | ###6       |  36% [A
2025-05-07T20:26:16.7988936Z nsight-compute-2024. | 443.1 MB  | #7         |  17% 
2025-05-07T20:26:16.7989338Z 
2025-05-07T20:26:16.7989354Z 
2025-05-07T20:26:16.7991256Z 
2025-05-07T20:26:16.8105448Z libcusparse-12.5.4.2 | 118.6 MB  | ########   |  81% [A[A[A
2025-05-07T20:26:16.8105773Z 
2025-05-07T20:26:16.8105777Z 
2025-05-07T20:26:16.8105787Z 
2025-05-07T20:26:16.8108791Z 
2025-05-07T20:26:16.8647956Z cuda-nsight-12.6.77  | 113.2 MB  | #######9   |  80% [A[A[A[A
2025-05-07T20:26:16.8648264Z 
2025-05-07T20:26:16.8648279Z 
2025-05-07T20:26:16.8893252Z libcufft-11.3.0.4    | 156.2 MB  | ######     |  60% [A[A
2025-05-07T20:26:16.8895497Z 
2025-05-07T20:26:16.9019304Z libcublas-12.6.4.1   | 256.2 MB  | ###8       |  38% [A
2025-05-07T20:26:16.9019712Z 
2025-05-07T20:26:16.9019787Z 
2025-05-07T20:26:16.9019797Z 
2025-05-07T20:26:16.9107492Z libcusparse-12.5.4.2 | 118.6 MB  | ########3  |  84% [A[A[A
2025-05-07T20:26:16.9107847Z 
2025-05-07T20:26:16.9107853Z 
2025-05-07T20:26:16.9107859Z 
2025-05-07T20:26:16.9110437Z 
2025-05-07T20:26:16.9556734Z cuda-nsight-12.6.77  | 113.2 MB  | ########3  |  83% [A[A[A[A
2025-05-07T20:26:16.9649123Z nsight-compute-2024. | 443.1 MB  | #7         |  18% 
2025-05-07T20:26:16.9649396Z 
2025-05-07T20:26:16.9651331Z 
2025-05-07T20:26:16.9896469Z libcufft-11.3.0.4    | 156.2 MB  | ######3    |  63% [A[A
2025-05-07T20:26:16.9897563Z 
2025-05-07T20:26:17.0019586Z libcublas-12.6.4.1   | 256.2 MB  | ###9       |  40% [A
2025-05-07T20:26:17.0019911Z 
2025-05-07T20:26:17.0019916Z 
2025-05-07T20:26:17.0024990Z 
2025-05-07T20:26:17.0167295Z libcusparse-12.5.4.2 | 118.6 MB  | ########6  |  87% [A[A[A
2025-05-07T20:26:17.0167606Z 
2025-05-07T20:26:17.0167610Z 
2025-05-07T20:26:17.0167614Z 
2025-05-07T20:26:17.0167617Z 
2025-05-07T20:26:17.0791198Z cuda-nsight-12.6.77  | 113.2 MB  | ########6  |  87% [A[A[A[A
2025-05-07T20:26:17.0791537Z 
2025-05-07T20:26:17.0791741Z 
2025-05-07T20:26:17.0807844Z libcufft-11.3.0.4    | 156.2 MB  | ######5    |  66% [A[A
2025-05-07T20:26:17.0963990Z nsight-compute-2024. | 443.1 MB  | #8         |  18% 
2025-05-07T20:26:17.0964447Z 
2025-05-07T20:26:17.1060437Z libcublas-12.6.4.1   | 256.2 MB  | ####1      |  41% [A
2025-05-07T20:26:17.1060814Z 
2025-05-07T20:26:17.1060818Z 
2025-05-07T20:26:17.1065820Z 
2025-05-07T20:26:17.1225166Z libcusparse-12.5.4.2 | 118.6 MB  | ########9  |  90% [A[A[A
2025-05-07T20:26:17.1225768Z 
2025-05-07T20:26:17.1225772Z 
2025-05-07T20:26:17.1225776Z 
2025-05-07T20:26:17.1226421Z 
2025-05-07T20:26:17.1794071Z cuda-nsight-12.6.77  | 113.2 MB  | ########9  |  90% [A[A[A[A
2025-05-07T20:26:17.1794412Z 
2025-05-07T20:26:17.1796031Z 
2025-05-07T20:26:17.1967622Z libcufft-11.3.0.4    | 156.2 MB  | ######7    |  68% [A[A
2025-05-07T20:26:17.1968067Z 
2025-05-07T20:26:17.2012363Z libcublas-12.6.4.1   | 256.2 MB  | ####2      |  43% [A
2025-05-07T20:26:17.2061328Z nsight-compute-2024. | 443.1 MB  | #8         |  19% 
2025-05-07T20:26:17.2061712Z 
2025-05-07T20:26:17.2061716Z 
2025-05-07T20:26:17.2061747Z 
2025-05-07T20:26:17.2228602Z libcusparse-12.5.4.2 | 118.6 MB  | #########3 |  93% [A[A[A
2025-05-07T20:26:17.2228975Z 
2025-05-07T20:26:17.2228979Z 
2025-05-07T20:26:17.2228983Z 
2025-05-07T20:26:17.2229646Z 
2025-05-07T20:26:17.2794614Z cuda-nsight-12.6.77  | 113.2 MB  | #########3 |  93% [A[A[A[A
2025-05-07T20:26:17.2794931Z 
2025-05-07T20:26:17.2794948Z 
2025-05-07T20:26:17.2970323Z libcufft-11.3.0.4    | 156.2 MB  | #######    |  70% [A[A
2025-05-07T20:26:17.2972003Z 
2025-05-07T20:26:17.3015823Z libcublas-12.6.4.1   | 256.2 MB  | ####4      |  44% [A
2025-05-07T20:26:17.3064278Z nsight-compute-2024. | 443.1 MB  | #9         |  20% 
2025-05-07T20:26:17.3064642Z 
2025-05-07T20:26:17.3064646Z 
2025-05-07T20:26:17.3067047Z 
2025-05-07T20:26:17.3228962Z libcusparse-12.5.4.2 | 118.6 MB  | #########6 |  96% [A[A[A
2025-05-07T20:26:17.3229378Z 
2025-05-07T20:26:17.3229384Z 
2025-05-07T20:26:17.3229388Z 
2025-05-07T20:26:17.3229394Z 
2025-05-07T20:26:17.3803858Z cuda-nsight-12.6.77  | 113.2 MB  | #########6 |  97% [A[A[A[A
2025-05-07T20:26:17.3804525Z 
2025-05-07T20:26:17.3804531Z 
2025-05-07T20:26:17.3973154Z libcufft-11.3.0.4    | 156.2 MB  | #######2   |  73% [A[A
2025-05-07T20:26:17.3973462Z 
2025-05-07T20:26:17.4018401Z libcublas-12.6.4.1   | 256.2 MB  | ####5      |  46% [A
2025-05-07T20:26:17.4067982Z nsight-compute-2024. | 443.1 MB  | ##         |  20% 
2025-05-07T20:26:17.4068373Z 
2025-05-07T20:26:17.4068380Z 
2025-05-07T20:26:17.4068385Z 
2025-05-07T20:26:17.4804126Z libcusparse-12.5.4.2 | 118.6 MB  | #########9 | 100% [A[A[A
2025-05-07T20:26:17.4804460Z 
2025-05-07T20:26:17.4806651Z 
2025-05-07T20:26:17.4973253Z libcufft-11.3.0.4    | 156.2 MB  | #######5   |  76% [A[A
2025-05-07T20:26:17.4973622Z 
2025-05-07T20:26:17.5024458Z libcublas-12.6.4.1   | 256.2 MB  | ####7      |  47% [A
2025-05-07T20:26:17.5804071Z nsight-compute-2024. | 443.1 MB  | ##1        |  21% 
2025-05-07T20:26:17.5804344Z 
2025-05-07T20:26:17.5804904Z 
2025-05-07T20:26:17.5976690Z libcufft-11.3.0.4    | 156.2 MB  | #######8   |  78% [A[A
2025-05-07T20:26:17.5978946Z 
2025-05-07T20:26:17.6024862Z libcublas-12.6.4.1   | 256.2 MB  | ####8      |  49% [A
2025-05-07T20:26:17.6805088Z nsight-compute-2024. | 443.1 MB  | ##2        |  22% 
2025-05-07T20:26:17.6805450Z 
2025-05-07T20:26:17.6806617Z 
2025-05-07T20:26:17.6984877Z libcufft-11.3.0.4    | 156.2 MB  | ########1  |  81% [A[A
2025-05-07T20:26:17.6985151Z 
2025-05-07T20:26:17.7030007Z libcublas-12.6.4.1   | 256.2 MB  | #####      |  51% [A
2025-05-07T20:26:17.7805762Z nsight-compute-2024. | 443.1 MB  | ##3        |  23% 
2025-05-07T20:26:17.7806360Z 
2025-05-07T20:26:17.7806481Z 
2025-05-07T20:26:17.7986266Z libcufft-11.3.0.4    | 156.2 MB  | ########4  |  84% [A[A
2025-05-07T20:26:17.7986895Z 
2025-05-07T20:26:17.8133093Z libcublas-12.6.4.1   | 256.2 MB  | #####2     |  52% [A
2025-05-07T20:26:17.8806789Z nsight-compute-2024. | 443.1 MB  | ##4        |  24% 
2025-05-07T20:26:17.8807101Z 
2025-05-07T20:26:17.8809103Z 
2025-05-07T20:26:17.8989079Z libcufft-11.3.0.4    | 156.2 MB  | ########7  |  87% [A[A
2025-05-07T20:26:17.8989368Z 
2025-05-07T20:26:17.9135099Z libcublas-12.6.4.1   | 256.2 MB  | #####4     |  54% [A
2025-05-07T20:26:17.9806871Z nsight-compute-2024. | 443.1 MB  | ##5        |  25% 
2025-05-07T20:26:17.9807213Z 
2025-05-07T20:26:17.9807217Z 
2025-05-07T20:26:17.9988926Z libcufft-11.3.0.4    | 156.2 MB  | #########  |  91% [A[A
2025-05-07T20:26:17.9990756Z 
2025-05-07T20:26:18.0811227Z libcublas-12.6.4.1   | 256.2 MB  | #####6     |  56% [A
2025-05-07T20:26:18.0811504Z 
2025-05-07T20:26:18.0811843Z 
2025-05-07T20:26:18.0991202Z libcufft-11.3.0.4    | 156.2 MB  | #########4 |  95% [A[A
2025-05-07T20:26:18.0993809Z 
2025-05-07T20:26:18.1132995Z libcublas-12.6.4.1   | 256.2 MB  | #####7     |  58% [A
2025-05-07T20:26:18.1934188Z nsight-compute-2024. | 443.1 MB  | ##5        |  26% 
2025-05-07T20:26:18.1934469Z 
2025-05-07T20:26:18.1935757Z 
2025-05-07T20:26:18.2134696Z libcufft-11.3.0.4    | 156.2 MB  | #########8 |  98% [A[A
2025-05-07T20:26:18.2146631Z nsight-compute-2024. | 443.1 MB  | ##6        |  27% 
2025-05-07T20:26:18.2150745Z 
2025-05-07T20:26:18.3135847Z libcublas-12.6.4.1   | 256.2 MB  | #####9     |  60% [A
2025-05-07T20:26:18.3147295Z nsight-compute-2024. | 443.1 MB  | ##7        |  28% 
2025-05-07T20:26:18.3148908Z 
2025-05-07T20:26:18.4137852Z libcublas-12.6.4.1   | 256.2 MB  | ######1    |  62% [A
2025-05-07T20:26:18.4148026Z nsight-compute-2024. | 443.1 MB  | ##9        |  29% 
2025-05-07T20:26:18.4148397Z 
2025-05-07T20:26:18.5140096Z libcublas-12.6.4.1   | 256.2 MB  | ######3    |  64% [A
2025-05-07T20:26:18.5150312Z nsight-compute-2024. | 443.1 MB  | ###        |  30% 
2025-05-07T20:26:18.5150763Z 
2025-05-07T20:26:18.6141107Z libcublas-12.6.4.1   | 256.2 MB  | ######5    |  66% [A
2025-05-07T20:26:18.6152519Z nsight-compute-2024. | 443.1 MB  | ###1       |  32% 
2025-05-07T20:26:18.6152904Z 
2025-05-07T20:26:18.7142856Z libcublas-12.6.4.1   | 256.2 MB  | ######7    |  68% [A
2025-05-07T20:26:18.7172754Z nsight-compute-2024. | 443.1 MB  | ###2       |  33% 
2025-05-07T20:26:18.7173124Z 
2025-05-07T20:26:18.8144360Z libcublas-12.6.4.1   | 256.2 MB  | ######9    |  69% [A
2025-05-07T20:26:18.8202142Z nsight-compute-2024. | 443.1 MB  | ###3       |  34% 
2025-05-07T20:26:18.8204903Z 
2025-05-07T20:26:18.9182476Z libcublas-12.6.4.1   | 256.2 MB  | #######1   |  71% [A
2025-05-07T20:26:18.9205536Z nsight-compute-2024. | 443.1 MB  | ###5       |  35% 
2025-05-07T20:26:18.9207650Z 
2025-05-07T20:26:19.0186350Z libcublas-12.6.4.1   | 256.2 MB  | #######3   |  73% [A
2025-05-07T20:26:19.0205977Z nsight-compute-2024. | 443.1 MB  | ###6       |  36% 
2025-05-07T20:26:19.0207969Z 
2025-05-07T20:26:19.1190379Z libcublas-12.6.4.1   | 256.2 MB  | #######5   |  75% [A
2025-05-07T20:26:19.1207154Z nsight-compute-2024. | 443.1 MB  | ###7       |  37% 
2025-05-07T20:26:19.1208069Z 
2025-05-07T20:26:19.2193530Z libcublas-12.6.4.1   | 256.2 MB  | #######7   |  77% [A
2025-05-07T20:26:19.2207442Z nsight-compute-2024. | 443.1 MB  | ###8       |  39% 
2025-05-07T20:26:19.2208025Z 
2025-05-07T20:26:19.3194737Z libcublas-12.6.4.1   | 256.2 MB  | #######9   |  79% [A
2025-05-07T20:26:19.3214369Z nsight-compute-2024. | 443.1 MB  | ###9       |  40% 
2025-05-07T20:26:19.3215305Z 
2025-05-07T20:26:19.4216631Z libcublas-12.6.4.1   | 256.2 MB  | ########1  |  81% [A
2025-05-07T20:26:19.4262415Z nsight-compute-2024. | 443.1 MB  | ####       |  41% 
2025-05-07T20:26:19.4264673Z 
2025-05-07T20:26:19.5217537Z libcublas-12.6.4.1   | 256.2 MB  | ########3  |  83% [A
2025-05-07T20:26:19.5317111Z nsight-compute-2024. | 443.1 MB  | ####2      |  42% 
2025-05-07T20:26:19.5318616Z 
2025-05-07T20:26:19.6218002Z libcublas-12.6.4.1   | 256.2 MB  | ########5  |  85% [A
2025-05-07T20:26:19.6319008Z nsight-compute-2024. | 443.1 MB  | ####3      |  43% 
2025-05-07T20:26:19.6321068Z 
2025-05-07T20:26:19.6431851Z libcublas-12.6.4.1   | 256.2 MB  | ########7  |  87% [A
2025-05-07T20:26:19.6432240Z 
2025-05-07T20:26:19.6432245Z 
2025-05-07T20:26:19.6432248Z 
2025-05-07T20:26:19.6433380Z 
2025-05-07T20:26:19.6815075Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:26:19.6815483Z 
2025-05-07T20:26:19.6815487Z 
2025-05-07T20:26:19.6818759Z 
2025-05-07T20:26:19.6945032Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:26:19.6945329Z 
2025-05-07T20:26:19.6945333Z 
2025-05-07T20:26:19.6945336Z 
2025-05-07T20:26:19.6945340Z 
2025-05-07T20:26:19.6948415Z 
2025-05-07T20:26:19.7518023Z cuda-nvvp-12.6.80    | 109.3 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:26:19.7522159Z nsight-compute-2024. | 443.1 MB  | ####4      |  45% 
2025-05-07T20:26:19.7522519Z 
2025-05-07T20:26:19.7522524Z 
2025-05-07T20:26:19.7522527Z 
2025-05-07T20:26:19.7522531Z 
2025-05-07T20:26:19.7522534Z 
2025-05-07T20:26:19.7525211Z 
2025-05-07T20:26:19.7572734Z libcusolver-11.7.1.2 | 95.8 MB   |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:19.7573188Z 
2025-05-07T20:26:19.7946598Z libcublas-12.6.4.1   | 256.2 MB  | ########9  |  89% [A
2025-05-07T20:26:19.7946880Z 
2025-05-07T20:26:19.7946884Z 
2025-05-07T20:26:19.7946888Z 
2025-05-07T20:26:19.7946902Z 
2025-05-07T20:26:19.7951033Z 
2025-05-07T20:26:19.8525587Z cuda-nvvp-12.6.80    | 109.3 MB  | 2          |   3% [A[A[A[A[A
2025-05-07T20:26:19.8525888Z 
2025-05-07T20:26:19.8525892Z 
2025-05-07T20:26:19.8525904Z 
2025-05-07T20:26:19.8525908Z 
2025-05-07T20:26:19.8525912Z 
2025-05-07T20:26:19.8527890Z 
2025-05-07T20:26:19.8953304Z libcusolver-11.7.1.2 | 95.8 MB   | 2          |   3% [A[A[A[A[A[A
2025-05-07T20:26:19.8953817Z 
2025-05-07T20:26:19.8953856Z 
2025-05-07T20:26:19.8953862Z 
2025-05-07T20:26:19.8953867Z 
2025-05-07T20:26:19.8954708Z 
2025-05-07T20:26:19.9114069Z cuda-nvvp-12.6.80    | 109.3 MB  | 5          |   5% [A[A[A[A[A
2025-05-07T20:26:19.9172185Z nsight-compute-2024. | 443.1 MB  | ####5      |  46% 
2025-05-07T20:26:19.9172552Z 
2025-05-07T20:26:19.9528447Z libcublas-12.6.4.1   | 256.2 MB  | #########1 |  91% [A
2025-05-07T20:26:19.9528723Z 
2025-05-07T20:26:19.9528754Z 
2025-05-07T20:26:19.9528758Z 
2025-05-07T20:26:19.9528761Z 
2025-05-07T20:26:19.9528765Z 
2025-05-07T20:26:19.9528941Z 
2025-05-07T20:26:19.9954857Z libcusolver-11.7.1.2 | 95.8 MB   | 5          |   5% [A[A[A[A[A[A
2025-05-07T20:26:19.9955228Z 
2025-05-07T20:26:19.9955234Z 
2025-05-07T20:26:19.9955239Z 
2025-05-07T20:26:19.9955245Z 
2025-05-07T20:26:19.9955250Z 
2025-05-07T20:26:20.0529328Z cuda-nvvp-12.6.80    | 109.3 MB  | 8          |   8% [A[A[A[A[A
2025-05-07T20:26:20.0529698Z 
2025-05-07T20:26:20.0529704Z 
2025-05-07T20:26:20.0529709Z 
2025-05-07T20:26:20.0529748Z 
2025-05-07T20:26:20.0529751Z 
2025-05-07T20:26:20.0532082Z 
2025-05-07T20:26:20.0619467Z libcusolver-11.7.1.2 | 95.8 MB   | 8          |   8% [A[A[A[A[A[A
2025-05-07T20:26:20.0619970Z 
2025-05-07T20:26:20.0765309Z libcublas-12.6.4.1   | 256.2 MB  | #########2 |  93% [A
2025-05-07T20:26:20.0957049Z nsight-compute-2024. | 443.1 MB  | ####6      |  47% 
2025-05-07T20:26:20.0957414Z 
2025-05-07T20:26:20.0957675Z 
2025-05-07T20:26:20.0957680Z 
2025-05-07T20:26:20.0957684Z 
2025-05-07T20:26:20.0957687Z 
2025-05-07T20:26:20.1535448Z cuda-nvvp-12.6.80    | 109.3 MB  | #          |  11% [A[A[A[A[A
2025-05-07T20:26:20.1535767Z 
2025-05-07T20:26:20.1535771Z 
2025-05-07T20:26:20.1535775Z 
2025-05-07T20:26:20.1535779Z 
2025-05-07T20:26:20.1535782Z 
2025-05-07T20:26:20.1535786Z 
2025-05-07T20:26:20.1973918Z libcusolver-11.7.1.2 | 95.8 MB   | #1         |  11% [A[A[A[A[A[A
2025-05-07T20:26:20.1974233Z 
2025-05-07T20:26:20.1974237Z 
2025-05-07T20:26:20.1974241Z 
2025-05-07T20:26:20.1974514Z 
2025-05-07T20:26:20.1975278Z 
2025-05-07T20:26:20.2102051Z cuda-nvvp-12.6.80    | 109.3 MB  | #3         |  14% [A[A[A[A[A
2025-05-07T20:26:20.2102346Z 
2025-05-07T20:26:20.2243190Z libcublas-12.6.4.1   | 256.2 MB  | #########4 |  94% [A
2025-05-07T20:26:20.2541385Z nsight-compute-2024. | 443.1 MB  | ####7      |  48% 
2025-05-07T20:26:20.2541666Z 
2025-05-07T20:26:20.2541671Z 
2025-05-07T20:26:20.2541675Z 
2025-05-07T20:26:20.2541703Z 
2025-05-07T20:26:20.2541706Z 
2025-05-07T20:26:20.2544341Z 
2025-05-07T20:26:20.2980023Z libcusolver-11.7.1.2 | 95.8 MB   | #4         |  14% [A[A[A[A[A[A
2025-05-07T20:26:20.2980347Z 
2025-05-07T20:26:20.2980356Z 
2025-05-07T20:26:20.2980359Z 
2025-05-07T20:26:20.2980363Z 
2025-05-07T20:26:20.2984347Z 
2025-05-07T20:26:20.3341467Z cuda-nvvp-12.6.80    | 109.3 MB  | #6         |  16% [A[A[A[A[A
2025-05-07T20:26:20.3341768Z 
2025-05-07T20:26:20.3545599Z libcublas-12.6.4.1   | 256.2 MB  | #########5 |  96% [A
2025-05-07T20:26:20.3545906Z 
2025-05-07T20:26:20.3545935Z 
2025-05-07T20:26:20.3545939Z 
2025-05-07T20:26:20.3545942Z 
2025-05-07T20:26:20.3545946Z 
2025-05-07T20:26:20.3547067Z 
2025-05-07T20:26:20.3567928Z libcusolver-11.7.1.2 | 95.8 MB   | #7         |  17% [A[A[A[A[A[A
2025-05-07T20:26:20.4000996Z nsight-compute-2024. | 443.1 MB  | ####8      |  48% 
2025-05-07T20:26:20.4001365Z 
2025-05-07T20:26:20.4001369Z 
2025-05-07T20:26:20.4001382Z 
2025-05-07T20:26:20.4001386Z 
2025-05-07T20:26:20.4006077Z 
2025-05-07T20:26:20.4458657Z cuda-nvvp-12.6.80    | 109.3 MB  | #9         |  19% [A[A[A[A[A
2025-05-07T20:26:20.4461913Z 
2025-05-07T20:26:20.4547724Z libcublas-12.6.4.1   | 256.2 MB  | #########6 |  97% [A
2025-05-07T20:26:20.4548106Z 
2025-05-07T20:26:20.4548109Z 
2025-05-07T20:26:20.4548113Z 
2025-05-07T20:26:20.4548116Z 
2025-05-07T20:26:20.4548129Z 
2025-05-07T20:26:20.4548134Z 
2025-05-07T20:26:20.4829898Z libcusolver-11.7.1.2 | 95.8 MB   | ##         |  20% [A[A[A[A[A[A
2025-05-07T20:26:20.5226598Z nsight-compute-2024. | 443.1 MB  | ####9      |  49% 
2025-05-07T20:26:20.5226974Z 
2025-05-07T20:26:20.5226978Z 
2025-05-07T20:26:20.5226982Z 
2025-05-07T20:26:20.5226985Z 
2025-05-07T20:26:20.5235444Z 
2025-05-07T20:26:20.5526225Z cuda-nvvp-12.6.80    | 109.3 MB  | ##1        |  22% [A[A[A[A[A
2025-05-07T20:26:20.5528267Z 
2025-05-07T20:26:20.5714943Z libcublas-12.6.4.1   | 256.2 MB  | #########8 |  98% [A
2025-05-07T20:26:20.5715253Z 
2025-05-07T20:26:20.5715279Z 
2025-05-07T20:26:20.5715284Z 
2025-05-07T20:26:20.5715287Z 
2025-05-07T20:26:20.5715291Z 
2025-05-07T20:26:20.5723171Z 
2025-05-07T20:26:20.5920875Z libcusolver-11.7.1.2 | 95.8 MB   | ##3        |  23% [A[A[A[A[A[A
2025-05-07T20:26:20.6228812Z nsight-compute-2024. | 443.1 MB  | ####9      |  50% 
2025-05-07T20:26:20.6229073Z 
2025-05-07T20:26:20.6229304Z 
2025-05-07T20:26:20.6229308Z 
2025-05-07T20:26:20.6229311Z 
2025-05-07T20:26:20.6229522Z 
2025-05-07T20:26:20.6614782Z cuda-nvvp-12.6.80    | 109.3 MB  | ##4        |  25% [A[A[A[A[A
2025-05-07T20:26:20.6617606Z 
2025-05-07T20:26:20.6719419Z libcublas-12.6.4.1   | 256.2 MB  | #########9 |  99% [A
2025-05-07T20:26:20.6719691Z 
2025-05-07T20:26:20.6719695Z 
2025-05-07T20:26:20.6719698Z 
2025-05-07T20:26:20.6719702Z 
2025-05-07T20:26:20.6719705Z 
2025-05-07T20:26:20.6732791Z 
2025-05-07T20:26:20.6926501Z libcusolver-11.7.1.2 | 95.8 MB   | ##6        |  26% [A[A[A[A[A[A
2025-05-07T20:26:20.7292950Z nsight-compute-2024. | 443.1 MB  | #####      |  50% 
2025-05-07T20:26:20.7293223Z 
2025-05-07T20:26:20.7293227Z 
2025-05-07T20:26:20.7293231Z 
2025-05-07T20:26:20.7293234Z 
2025-05-07T20:26:20.7297159Z 
2025-05-07T20:26:20.7791575Z cuda-nvvp-12.6.80    | 109.3 MB  | ##7        |  27% [A[A[A[A[A
2025-05-07T20:26:20.7791980Z 
2025-05-07T20:26:20.7791986Z 
2025-05-07T20:26:20.7791991Z 
2025-05-07T20:26:20.7791995Z 
2025-05-07T20:26:20.7792000Z 
2025-05-07T20:26:20.7793300Z 
2025-05-07T20:26:20.7927761Z libcusolver-11.7.1.2 | 95.8 MB   | ##9        |  29% [A[A[A[A[A[A
2025-05-07T20:26:20.8373754Z nsight-compute-2024. | 443.1 MB  | #####1     |  51% 
2025-05-07T20:26:20.8374343Z 
2025-05-07T20:26:20.8374346Z 
2025-05-07T20:26:20.8374350Z 
2025-05-07T20:26:20.8374354Z 
2025-05-07T20:26:20.8374361Z 
2025-05-07T20:26:20.8828144Z cuda-nvvp-12.6.80    | 109.3 MB  | ##9        |  30% [A[A[A[A[A
2025-05-07T20:26:20.8828437Z 
2025-05-07T20:26:20.8828441Z 
2025-05-07T20:26:20.8828444Z 
2025-05-07T20:26:20.8828448Z 
2025-05-07T20:26:20.8828452Z 
2025-05-07T20:26:20.8829870Z 
2025-05-07T20:26:20.8933693Z libcusolver-11.7.1.2 | 95.8 MB   | ###2       |  32% [A[A[A[A[A[A
2025-05-07T20:26:20.9376286Z nsight-compute-2024. | 443.1 MB  | #####1     |  52% 
2025-05-07T20:26:20.9376669Z 
2025-05-07T20:26:20.9376674Z 
2025-05-07T20:26:20.9376679Z 
2025-05-07T20:26:20.9376684Z 
2025-05-07T20:26:20.9376689Z 
2025-05-07T20:26:20.9843896Z cuda-nvvp-12.6.80    | 109.3 MB  | ###2       |  32% [A[A[A[A[A
2025-05-07T20:26:20.9844300Z 
2025-05-07T20:26:20.9844307Z 
2025-05-07T20:26:20.9844312Z 
2025-05-07T20:26:20.9844318Z 
2025-05-07T20:26:20.9844351Z 
2025-05-07T20:26:20.9850477Z 
2025-05-07T20:26:20.9938058Z libcusolver-11.7.1.2 | 95.8 MB   | ###5       |  36% [A[A[A[A[A[A
2025-05-07T20:26:21.0379141Z nsight-compute-2024. | 443.1 MB  | #####2     |  53% 
2025-05-07T20:26:21.0379534Z 
2025-05-07T20:26:21.0379690Z 
2025-05-07T20:26:21.0379696Z 
2025-05-07T20:26:21.0379701Z 
2025-05-07T20:26:21.0385352Z 
2025-05-07T20:26:21.0854460Z cuda-nvvp-12.6.80    | 109.3 MB  | ###5       |  35% [A[A[A[A[A
2025-05-07T20:26:21.0854995Z 
2025-05-07T20:26:21.0855000Z 
2025-05-07T20:26:21.0855005Z 
2025-05-07T20:26:21.0855010Z 
2025-05-07T20:26:21.0855015Z 
2025-05-07T20:26:21.0855021Z 
2025-05-07T20:26:21.0938975Z libcusolver-11.7.1.2 | 95.8 MB   | ###8       |  39% [A[A[A[A[A[A
2025-05-07T20:26:21.1416833Z nsight-compute-2024. | 443.1 MB  | #####3     |  53% 
2025-05-07T20:26:21.1417218Z 
2025-05-07T20:26:21.1417225Z 
2025-05-07T20:26:21.1417230Z 
2025-05-07T20:26:21.1417234Z 
2025-05-07T20:26:21.1417239Z 
2025-05-07T20:26:21.1855439Z cuda-nvvp-12.6.80    | 109.3 MB  | ###7       |  38% [A[A[A[A[A
2025-05-07T20:26:21.1855823Z 
2025-05-07T20:26:21.1855827Z 
2025-05-07T20:26:21.1855830Z 
2025-05-07T20:26:21.1855836Z 
2025-05-07T20:26:21.1855841Z 
2025-05-07T20:26:21.1855844Z 
2025-05-07T20:26:21.1977861Z libcusolver-11.7.1.2 | 95.8 MB   | ####1      |  42% [A[A[A[A[A[A
2025-05-07T20:26:21.2420465Z nsight-compute-2024. | 443.1 MB  | #####4     |  54% 
2025-05-07T20:26:21.2420734Z 
2025-05-07T20:26:21.2420738Z 
2025-05-07T20:26:21.2420741Z 
2025-05-07T20:26:21.2420745Z 
2025-05-07T20:26:21.2420751Z 
2025-05-07T20:26:21.2859444Z cuda-nvvp-12.6.80    | 109.3 MB  | ####       |  41% [A[A[A[A[A
2025-05-07T20:26:21.2859739Z 
2025-05-07T20:26:21.2859743Z 
2025-05-07T20:26:21.2859746Z 
2025-05-07T20:26:21.2859750Z 
2025-05-07T20:26:21.2859753Z 
2025-05-07T20:26:21.2859766Z 
2025-05-07T20:26:21.3030547Z libcusolver-11.7.1.2 | 95.8 MB   | ####4      |  45% [A[A[A[A[A[A
2025-05-07T20:26:21.3423306Z nsight-compute-2024. | 443.1 MB  | #####4     |  55% 
2025-05-07T20:26:21.3423626Z 
2025-05-07T20:26:21.3423632Z 
2025-05-07T20:26:21.3423637Z 
2025-05-07T20:26:21.3423642Z 
2025-05-07T20:26:21.3425081Z 
2025-05-07T20:26:21.3861005Z cuda-nvvp-12.6.80    | 109.3 MB  | ####3      |  44% [A[A[A[A[A
2025-05-07T20:26:21.3861435Z 
2025-05-07T20:26:21.3861442Z 
2025-05-07T20:26:21.3861446Z 
2025-05-07T20:26:21.3861451Z 
2025-05-07T20:26:21.3861457Z 
2025-05-07T20:26:21.3861731Z 
2025-05-07T20:26:21.4034862Z libcusolver-11.7.1.2 | 95.8 MB   | ####8      |  48% [A[A[A[A[A[A
2025-05-07T20:26:21.4469727Z nsight-compute-2024. | 443.1 MB  | #####5     |  55% 
2025-05-07T20:26:21.4470085Z 
2025-05-07T20:26:21.4470090Z 
2025-05-07T20:26:21.4470093Z 
2025-05-07T20:26:21.4470097Z 
2025-05-07T20:26:21.4471061Z 
2025-05-07T20:26:21.4862647Z cuda-nvvp-12.6.80    | 109.3 MB  | ####6      |  47% [A[A[A[A[A
2025-05-07T20:26:21.4862944Z 
2025-05-07T20:26:21.4862949Z 
2025-05-07T20:26:21.4862952Z 
2025-05-07T20:26:21.4862956Z 
2025-05-07T20:26:21.4863243Z 
2025-05-07T20:26:21.4863260Z 
2025-05-07T20:26:21.5070018Z libcusolver-11.7.1.2 | 95.8 MB   | #####1     |  52% [A[A[A[A[A[A
2025-05-07T20:26:21.5479561Z nsight-compute-2024. | 443.1 MB  | #####6     |  56% 
2025-05-07T20:26:21.5479824Z 
2025-05-07T20:26:21.5479828Z 
2025-05-07T20:26:21.5479833Z 
2025-05-07T20:26:21.5479836Z 
2025-05-07T20:26:21.5481771Z 
2025-05-07T20:26:21.5995904Z cuda-nvvp-12.6.80    | 109.3 MB  | ####9      |  49% [A[A[A[A[A
2025-05-07T20:26:21.5996259Z 
2025-05-07T20:26:21.5996263Z 
2025-05-07T20:26:21.5996266Z 
2025-05-07T20:26:21.5996270Z 
2025-05-07T20:26:21.5996273Z 
2025-05-07T20:26:21.5996277Z 
2025-05-07T20:26:21.6074260Z libcusolver-11.7.1.2 | 95.8 MB   | #####5     |  55% [A[A[A[A[A[A
2025-05-07T20:26:21.6484842Z nsight-compute-2024. | 443.1 MB  | #####6     |  57% 
2025-05-07T20:26:21.6485117Z 
2025-05-07T20:26:21.6485121Z 
2025-05-07T20:26:21.6485125Z 
2025-05-07T20:26:21.6485128Z 
2025-05-07T20:26:21.6485132Z 
2025-05-07T20:26:21.7002246Z cuda-nvvp-12.6.80    | 109.3 MB  | #####2     |  53% [A[A[A[A[A
2025-05-07T20:26:21.7002574Z 
2025-05-07T20:26:21.7002578Z 
2025-05-07T20:26:21.7002582Z 
2025-05-07T20:26:21.7002585Z 
2025-05-07T20:26:21.7002589Z 
2025-05-07T20:26:21.7002593Z 
2025-05-07T20:26:21.7098534Z libcusolver-11.7.1.2 | 95.8 MB   | #####8     |  59% [A[A[A[A[A[A
2025-05-07T20:26:21.7488039Z nsight-compute-2024. | 443.1 MB  | #####7     |  58% 
2025-05-07T20:26:21.7488446Z 
2025-05-07T20:26:21.7488452Z 
2025-05-07T20:26:21.7488457Z 
2025-05-07T20:26:21.7488462Z 
2025-05-07T20:26:21.7494244Z 
2025-05-07T20:26:21.8025369Z cuda-nvvp-12.6.80    | 109.3 MB  | #####5     |  55% [A[A[A[A[A
2025-05-07T20:26:21.8026003Z 
2025-05-07T20:26:21.8026009Z 
2025-05-07T20:26:21.8026013Z 
2025-05-07T20:26:21.8026018Z 
2025-05-07T20:26:21.8026023Z 
2025-05-07T20:26:21.8026028Z 
2025-05-07T20:26:21.8106122Z libcusolver-11.7.1.2 | 95.8 MB   | ######1    |  62% [A[A[A[A[A[A
2025-05-07T20:26:21.8488458Z nsight-compute-2024. | 443.1 MB  | #####8     |  58% 
2025-05-07T20:26:21.8488833Z 
2025-05-07T20:26:21.8488837Z 
2025-05-07T20:26:21.8488841Z 
2025-05-07T20:26:21.8488844Z 
2025-05-07T20:26:21.8488975Z 
2025-05-07T20:26:21.9109536Z cuda-nvvp-12.6.80    | 109.3 MB  | #####8     |  58% [A[A[A[A[A
2025-05-07T20:26:21.9201506Z nsight-compute-2024. | 443.1 MB  | #####9     |  59% 
2025-05-07T20:26:21.9201780Z 
2025-05-07T20:26:21.9201784Z 
2025-05-07T20:26:21.9201788Z 
2025-05-07T20:26:21.9201813Z 
2025-05-07T20:26:21.9201817Z 
2025-05-07T20:26:21.9205051Z 
2025-05-07T20:26:21.9371492Z libcusolver-11.7.1.2 | 95.8 MB   | ######4    |  65% [A[A[A[A[A[A
2025-05-07T20:26:21.9371801Z 
2025-05-07T20:26:21.9376855Z 
2025-05-07T20:26:21.9576305Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:26:21.9576589Z 
2025-05-07T20:26:21.9576592Z 
2025-05-07T20:26:21.9576596Z 
2025-05-07T20:26:21.9576599Z 
2025-05-07T20:26:21.9576603Z 
2025-05-07T20:26:22.0238269Z cuda-nvvp-12.6.80    | 109.3 MB  | ######1    |  61% [A[A[A[A[A
2025-05-07T20:26:22.0238713Z 
2025-05-07T20:26:22.0238719Z 
2025-05-07T20:26:22.0238724Z 
2025-05-07T20:26:22.0238729Z 
2025-05-07T20:26:22.0238734Z 
2025-05-07T20:26:22.0238739Z 
2025-05-07T20:26:22.0266510Z libcusolver-11.7.1.2 | 95.8 MB   | ######8    |  68% [A[A[A[A[A[A
2025-05-07T20:26:22.0324951Z nsight-compute-2024. | 443.1 MB  | #####9     |  60% 
2025-05-07T20:26:22.0325209Z 
2025-05-07T20:26:22.0325213Z 
2025-05-07T20:26:22.0325782Z 
2025-05-07T20:26:22.0325789Z 
2025-05-07T20:26:22.0325793Z 
2025-05-07T20:26:22.0325796Z 
2025-05-07T20:26:22.0328744Z 
2025-05-07T20:26:22.0715942Z libnpp-12.3.1.54     | 93.4 MB   |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:26:22.0716238Z 
2025-05-07T20:26:22.0716242Z 
2025-05-07T20:26:22.0716245Z 
2025-05-07T20:26:22.0716249Z 
2025-05-07T20:26:22.0717615Z 
2025-05-07T20:26:22.1307481Z cuda-nvvp-12.6.80    | 109.3 MB  | ######4    |  64% [A[A[A[A[A
2025-05-07T20:26:22.1307800Z 
2025-05-07T20:26:22.1307804Z 
2025-05-07T20:26:22.1307807Z 
2025-05-07T20:26:22.1308096Z 
2025-05-07T20:26:22.1308100Z 
2025-05-07T20:26:22.1309206Z 
2025-05-07T20:26:22.1328229Z libcusolver-11.7.1.2 | 95.8 MB   | #######1   |  71% [A[A[A[A[A[A
2025-05-07T20:26:22.1328541Z 
2025-05-07T20:26:22.1328545Z 
2025-05-07T20:26:22.1328548Z 
2025-05-07T20:26:22.1328552Z 
2025-05-07T20:26:22.1328556Z 
2025-05-07T20:26:22.1328560Z 
2025-05-07T20:26:22.1328891Z 
2025-05-07T20:26:22.1516786Z libnpp-12.3.1.54     | 93.4 MB   | 2          |   3% [A[A[A[A[A[A[A
2025-05-07T20:26:22.1841739Z nsight-compute-2024. | 443.1 MB  | ######     |  61% 
2025-05-07T20:26:22.1842097Z 
2025-05-07T20:26:22.1842101Z 
2025-05-07T20:26:22.1842105Z 
2025-05-07T20:26:22.1842108Z 
2025-05-07T20:26:22.1845534Z 
2025-05-07T20:26:22.2335607Z cuda-nvvp-12.6.80    | 109.3 MB  | ######6    |  67% [A[A[A[A[A
2025-05-07T20:26:22.2335983Z 
2025-05-07T20:26:22.2335987Z 
2025-05-07T20:26:22.2335990Z 
2025-05-07T20:26:22.2335994Z 
2025-05-07T20:26:22.2335997Z 
2025-05-07T20:26:22.2336001Z 
2025-05-07T20:26:22.2337504Z 
2025-05-07T20:26:22.2420081Z libnpp-12.3.1.54     | 93.4 MB   | 5          |   5% [A[A[A[A[A[A[A
2025-05-07T20:26:22.2420410Z 
2025-05-07T20:26:22.2420416Z 
2025-05-07T20:26:22.2420422Z 
2025-05-07T20:26:22.2420427Z 
2025-05-07T20:26:22.2420432Z 
2025-05-07T20:26:22.2423635Z 
2025-05-07T20:26:22.2622815Z libcusolver-11.7.1.2 | 95.8 MB   | #######4   |  74% [A[A[A[A[A[A
2025-05-07T20:26:22.2989953Z nsight-compute-2024. | 443.1 MB  | ######1    |  61% 
2025-05-07T20:26:22.2990222Z 
2025-05-07T20:26:22.2990226Z 
2025-05-07T20:26:22.2990230Z 
2025-05-07T20:26:22.2990233Z 
2025-05-07T20:26:22.2991885Z 
2025-05-07T20:26:22.3336692Z cuda-nvvp-12.6.80    | 109.3 MB  | ######9    |  70% [A[A[A[A[A
2025-05-07T20:26:22.3337056Z 
2025-05-07T20:26:22.3337060Z 
2025-05-07T20:26:22.3337064Z 
2025-05-07T20:26:22.3337067Z 
2025-05-07T20:26:22.3337071Z 
2025-05-07T20:26:22.3337074Z 
2025-05-07T20:26:22.3337078Z 
2025-05-07T20:26:22.3556103Z libnpp-12.3.1.54     | 93.4 MB   | 8          |   8% [A[A[A[A[A[A[A
2025-05-07T20:26:22.3556520Z 
2025-05-07T20:26:22.3556527Z 
2025-05-07T20:26:22.3556532Z 
2025-05-07T20:26:22.3556537Z 
2025-05-07T20:26:22.3556542Z 
2025-05-07T20:26:22.3560382Z 
2025-05-07T20:26:22.3624731Z libcusolver-11.7.1.2 | 95.8 MB   | #######6   |  77% [A[A[A[A[A[A
2025-05-07T20:26:22.3991371Z nsight-compute-2024. | 443.1 MB  | ######1    |  62% 
2025-05-07T20:26:22.3991741Z 
2025-05-07T20:26:22.3991764Z 
2025-05-07T20:26:22.3991768Z 
2025-05-07T20:26:22.3991771Z 
2025-05-07T20:26:22.3993557Z 
2025-05-07T20:26:22.4341976Z cuda-nvvp-12.6.80    | 109.3 MB  | #######2   |  72% [A[A[A[A[A
2025-05-07T20:26:22.4342303Z 
2025-05-07T20:26:22.4342307Z 
2025-05-07T20:26:22.4342311Z 
2025-05-07T20:26:22.4342314Z 
2025-05-07T20:26:22.4342318Z 
2025-05-07T20:26:22.4342321Z 
2025-05-07T20:26:22.4344091Z 
2025-05-07T20:26:22.4582654Z libnpp-12.3.1.54     | 93.4 MB   | #1         |  11% [A[A[A[A[A[A[A
2025-05-07T20:26:22.4582979Z 
2025-05-07T20:26:22.4582982Z 
2025-05-07T20:26:22.4583002Z 
2025-05-07T20:26:22.4583006Z 
2025-05-07T20:26:22.4583009Z 
2025-05-07T20:26:22.4584475Z 
2025-05-07T20:26:22.4700722Z libcusolver-11.7.1.2 | 95.8 MB   | #######9   |  80% [A[A[A[A[A[A
2025-05-07T20:26:22.5138404Z nsight-compute-2024. | 443.1 MB  | ######2    |  63% 
2025-05-07T20:26:22.5138761Z 
2025-05-07T20:26:22.5138765Z 
2025-05-07T20:26:22.5138769Z 
2025-05-07T20:26:22.5138772Z 
2025-05-07T20:26:22.5142199Z 
2025-05-07T20:26:22.5342287Z cuda-nvvp-12.6.80    | 109.3 MB  | #######4   |  75% [A[A[A[A[A
2025-05-07T20:26:22.5342671Z 
2025-05-07T20:26:22.5342675Z 
2025-05-07T20:26:22.5342679Z 
2025-05-07T20:26:22.5342683Z 
2025-05-07T20:26:22.5342686Z 
2025-05-07T20:26:22.5342690Z 
2025-05-07T20:26:22.5342693Z 
2025-05-07T20:26:22.5621332Z libnpp-12.3.1.54     | 93.4 MB   | #4         |  14% [A[A[A[A[A[A[A
2025-05-07T20:26:22.5621631Z 
2025-05-07T20:26:22.5621635Z 
2025-05-07T20:26:22.5621639Z 
2025-05-07T20:26:22.5621642Z 
2025-05-07T20:26:22.5621646Z 
2025-05-07T20:26:22.5622445Z 
2025-05-07T20:26:22.5722234Z libcusolver-11.7.1.2 | 95.8 MB   | ########2  |  82% [A[A[A[A[A[A
2025-05-07T20:26:22.6189722Z nsight-compute-2024. | 443.1 MB  | ######3    |  63% 
2025-05-07T20:26:22.6190105Z 
2025-05-07T20:26:22.6190112Z 
2025-05-07T20:26:22.6190118Z 
2025-05-07T20:26:22.6190123Z 
2025-05-07T20:26:22.6193280Z 
2025-05-07T20:26:22.6348024Z cuda-nvvp-12.6.80    | 109.3 MB  | #######7   |  77% [A[A[A[A[A
2025-05-07T20:26:22.6348461Z 
2025-05-07T20:26:22.6348468Z 
2025-05-07T20:26:22.6348473Z 
2025-05-07T20:26:22.6348477Z 
2025-05-07T20:26:22.6348482Z 
2025-05-07T20:26:22.6348487Z 
2025-05-07T20:26:22.6350530Z 
2025-05-07T20:26:22.6652946Z libnpp-12.3.1.54     | 93.4 MB   | #6         |  17% [A[A[A[A[A[A[A
2025-05-07T20:26:22.6653357Z 
2025-05-07T20:26:22.6653362Z 
2025-05-07T20:26:22.6653367Z 
2025-05-07T20:26:22.6653372Z 
2025-05-07T20:26:22.6653377Z 
2025-05-07T20:26:22.6653386Z 
2025-05-07T20:26:22.6744793Z libcusolver-11.7.1.2 | 95.8 MB   | ########5  |  85% [A[A[A[A[A[A
2025-05-07T20:26:22.7192465Z nsight-compute-2024. | 443.1 MB  | ######3    |  64% 
2025-05-07T20:26:22.7192819Z 
2025-05-07T20:26:22.7192833Z 
2025-05-07T20:26:22.7192838Z 
2025-05-07T20:26:22.7192844Z 
2025-05-07T20:26:22.7195885Z 
2025-05-07T20:26:22.7352104Z cuda-nvvp-12.6.80    | 109.3 MB  | #######9   |  80% [A[A[A[A[A
2025-05-07T20:26:22.7352509Z 
2025-05-07T20:26:22.7352514Z 
2025-05-07T20:26:22.7352521Z 
2025-05-07T20:26:22.7352545Z 
2025-05-07T20:26:22.7352551Z 
2025-05-07T20:26:22.7352556Z 
2025-05-07T20:26:22.7355637Z 
2025-05-07T20:26:22.7753047Z libnpp-12.3.1.54     | 93.4 MB   | ##         |  20% [A[A[A[A[A[A[A
2025-05-07T20:26:22.7756191Z nsight-compute-2024. | 443.1 MB  | ######4    |  65% 
2025-05-07T20:26:22.7756573Z 
2025-05-07T20:26:22.7756579Z 
2025-05-07T20:26:22.7756584Z 
2025-05-07T20:26:22.7756589Z 
2025-05-07T20:26:22.7756594Z 
2025-05-07T20:26:22.7759259Z 
2025-05-07T20:26:22.8198078Z libcusolver-11.7.1.2 | 95.8 MB   | ########7  |  88% [A[A[A[A[A[A
2025-05-07T20:26:22.8198520Z 
2025-05-07T20:26:22.8198525Z 
2025-05-07T20:26:22.8198532Z 
2025-05-07T20:26:22.8198537Z 
2025-05-07T20:26:22.8201028Z 
2025-05-07T20:26:22.8358985Z cuda-nvvp-12.6.80    | 109.3 MB  | ########2  |  82% [A[A[A[A[A
2025-05-07T20:26:22.8359369Z 
2025-05-07T20:26:22.8359373Z 
2025-05-07T20:26:22.8359376Z 
2025-05-07T20:26:22.8359380Z 
2025-05-07T20:26:22.8359384Z 
2025-05-07T20:26:22.8359388Z 
2025-05-07T20:26:22.8359404Z 
2025-05-07T20:26:22.8758900Z libnpp-12.3.1.54     | 93.4 MB   | ##3        |  23% [A[A[A[A[A[A[A
2025-05-07T20:26:22.8759309Z 
2025-05-07T20:26:22.8759315Z 
2025-05-07T20:26:22.8759320Z 
2025-05-07T20:26:22.8759325Z 
2025-05-07T20:26:22.8759330Z 
2025-05-07T20:26:22.8761176Z 
2025-05-07T20:26:22.8821323Z libcusolver-11.7.1.2 | 95.8 MB   | #########  |  91% [A[A[A[A[A[A
2025-05-07T20:26:22.9203085Z nsight-compute-2024. | 443.1 MB  | ######5    |  65% 
2025-05-07T20:26:22.9203461Z 
2025-05-07T20:26:22.9203466Z 
2025-05-07T20:26:22.9203472Z 
2025-05-07T20:26:22.9203500Z 
2025-05-07T20:26:22.9205089Z 
2025-05-07T20:26:22.9388380Z cuda-nvvp-12.6.80    | 109.3 MB  | ########4  |  85% [A[A[A[A[A
2025-05-07T20:26:22.9388897Z 
2025-05-07T20:26:22.9388903Z 
2025-05-07T20:26:22.9388908Z 
2025-05-07T20:26:22.9388914Z 
2025-05-07T20:26:22.9388919Z 
2025-05-07T20:26:22.9388924Z 
2025-05-07T20:26:22.9392719Z 
2025-05-07T20:26:22.9791678Z libnpp-12.3.1.54     | 93.4 MB   | ##5        |  26% [A[A[A[A[A[A[A
2025-05-07T20:26:22.9792091Z 
2025-05-07T20:26:22.9792097Z 
2025-05-07T20:26:22.9792102Z 
2025-05-07T20:26:22.9792107Z 
2025-05-07T20:26:22.9792112Z 
2025-05-07T20:26:22.9794386Z 
2025-05-07T20:26:22.9950746Z libcusolver-11.7.1.2 | 95.8 MB   | #########3 |  93% [A[A[A[A[A[A
2025-05-07T20:26:23.0274231Z nsight-compute-2024. | 443.1 MB  | ######5    |  66% 
2025-05-07T20:26:23.0274570Z 
2025-05-07T20:26:23.0274576Z 
2025-05-07T20:26:23.0274581Z 
2025-05-07T20:26:23.0274586Z 
2025-05-07T20:26:23.0277875Z 
2025-05-07T20:26:23.0395115Z cuda-nvvp-12.6.80    | 109.3 MB  | ########7  |  87% [A[A[A[A[A
2025-05-07T20:26:23.0395674Z 
2025-05-07T20:26:23.0395678Z 
2025-05-07T20:26:23.0395681Z 
2025-05-07T20:26:23.0395685Z 
2025-05-07T20:26:23.0395688Z 
2025-05-07T20:26:23.0395692Z 
2025-05-07T20:26:23.0397157Z 
2025-05-07T20:26:23.0802865Z libnpp-12.3.1.54     | 93.4 MB   | ##8        |  29% [A[A[A[A[A[A[A
2025-05-07T20:26:23.0803257Z 
2025-05-07T20:26:23.0803263Z 
2025-05-07T20:26:23.0803285Z 
2025-05-07T20:26:23.0803291Z 
2025-05-07T20:26:23.0803296Z 
2025-05-07T20:26:23.0806711Z 
2025-05-07T20:26:23.1083868Z libcusolver-11.7.1.2 | 95.8 MB   | #########5 |  96% [A[A[A[A[A[A
2025-05-07T20:26:23.1431242Z nsight-compute-2024. | 443.1 MB  | ######6    |  66% 
2025-05-07T20:26:23.1431624Z 
2025-05-07T20:26:23.1431630Z 
2025-05-07T20:26:23.1431636Z 
2025-05-07T20:26:23.1431642Z 
2025-05-07T20:26:23.1431648Z 
2025-05-07T20:26:23.1431655Z 
2025-05-07T20:26:23.1434304Z 
2025-05-07T20:26:23.1459928Z libnpp-12.3.1.54     | 93.4 MB   | ###1       |  32% [A[A[A[A[A[A[A
2025-05-07T20:26:23.1460405Z 
2025-05-07T20:26:23.1460410Z 
2025-05-07T20:26:23.1460415Z 
2025-05-07T20:26:23.1460420Z 
2025-05-07T20:26:23.1460425Z 
2025-05-07T20:26:23.1807459Z cuda-nvvp-12.6.80    | 109.3 MB  | ########9  |  90% [A[A[A[A[A
2025-05-07T20:26:23.1807864Z 
2025-05-07T20:26:23.1807869Z 
2025-05-07T20:26:23.1807874Z 
2025-05-07T20:26:23.1807879Z 
2025-05-07T20:26:23.1807884Z 
2025-05-07T20:26:23.1809995Z 
2025-05-07T20:26:23.2121671Z libcusolver-11.7.1.2 | 95.8 MB   | #########8 |  99% [A[A[A[A[A[A
2025-05-07T20:26:23.2433347Z nsight-compute-2024. | 443.1 MB  | ######6    |  67% 
2025-05-07T20:26:23.2433731Z 
2025-05-07T20:26:23.2433737Z 
2025-05-07T20:26:23.2433742Z 
2025-05-07T20:26:23.2433747Z 
2025-05-07T20:26:23.2433752Z 
2025-05-07T20:26:23.2433758Z 
2025-05-07T20:26:23.2437176Z 
2025-05-07T20:26:23.2505226Z libnpp-12.3.1.54     | 93.4 MB   | ###4       |  35% [A[A[A[A[A[A[A
2025-05-07T20:26:23.2505524Z 
2025-05-07T20:26:23.2505538Z 
2025-05-07T20:26:23.2505556Z 
2025-05-07T20:26:23.2505560Z 
2025-05-07T20:26:23.2505563Z 
2025-05-07T20:26:23.3129911Z cuda-nvvp-12.6.80    | 109.3 MB  | #########1 |  92% [A[A[A[A[A
2025-05-07T20:26:23.3434048Z nsight-compute-2024. | 443.1 MB  | ######7    |  68% 
2025-05-07T20:26:23.3434360Z 
2025-05-07T20:26:23.3434365Z 
2025-05-07T20:26:23.3434369Z 
2025-05-07T20:26:23.3434372Z 
2025-05-07T20:26:23.3434376Z 
2025-05-07T20:26:23.3434379Z 
2025-05-07T20:26:23.3434402Z 
2025-05-07T20:26:23.3509056Z libnpp-12.3.1.54     | 93.4 MB   | ###7       |  38% [A[A[A[A[A[A[A
2025-05-07T20:26:23.3509362Z 
2025-05-07T20:26:23.3509368Z 
2025-05-07T20:26:23.3509374Z 
2025-05-07T20:26:23.3509379Z 
2025-05-07T20:26:23.3511336Z 
2025-05-07T20:26:23.4130485Z cuda-nvvp-12.6.80    | 109.3 MB  | #########4 |  94% [A[A[A[A[A
2025-05-07T20:26:23.4437157Z nsight-compute-2024. | 443.1 MB  | ######8    |  68% 
2025-05-07T20:26:23.4437462Z 
2025-05-07T20:26:23.4437466Z 
2025-05-07T20:26:23.4437470Z 
2025-05-07T20:26:23.4437473Z 
2025-05-07T20:26:23.4437496Z 
2025-05-07T20:26:23.4437500Z 
2025-05-07T20:26:23.4440878Z 
2025-05-07T20:26:23.4511830Z libnpp-12.3.1.54     | 93.4 MB   | ####1      |  41% [A[A[A[A[A[A[A
2025-05-07T20:26:23.4512216Z 
2025-05-07T20:26:23.4512221Z 
2025-05-07T20:26:23.4512224Z 
2025-05-07T20:26:23.4512228Z 
2025-05-07T20:26:23.4512231Z 
2025-05-07T20:26:23.5137595Z cuda-nvvp-12.6.80    | 109.3 MB  | #########6 |  97% [A[A[A[A[A
2025-05-07T20:26:23.5401454Z nsight-compute-2024. | 443.1 MB  | ######8    |  69% 
2025-05-07T20:26:23.5401836Z 
2025-05-07T20:26:23.5401843Z 
2025-05-07T20:26:23.5401848Z 
2025-05-07T20:26:23.5404918Z 
2025-05-07T20:26:23.5516083Z cuda-nsight-12.6.77  | 113.2 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:26:23.5516445Z 
2025-05-07T20:26:23.5516451Z 
2025-05-07T20:26:23.5516456Z 
2025-05-07T20:26:23.5516462Z 
2025-05-07T20:26:23.5520715Z 
2025-05-07T20:26:23.5531702Z cuda-nvvp-12.6.80    | 109.3 MB  | #########9 |  99% [A[A[A[A[A
2025-05-07T20:26:23.5532003Z 
2025-05-07T20:26:23.5532284Z 
2025-05-07T20:26:23.5532288Z 
2025-05-07T20:26:23.5532291Z 
2025-05-07T20:26:23.5532295Z 
2025-05-07T20:26:23.5532298Z 
2025-05-07T20:26:23.5532302Z 
2025-05-07T20:26:23.6141404Z libnpp-12.3.1.54     | 93.4 MB   | ####4      |  44% [A[A[A[A[A[A[A
2025-05-07T20:26:23.6538024Z nsight-compute-2024. | 443.1 MB  | ######9    |  69% 
2025-05-07T20:26:23.6538304Z 
2025-05-07T20:26:23.6538308Z 
2025-05-07T20:26:23.6538313Z 
2025-05-07T20:26:23.6538333Z 
2025-05-07T20:26:23.6538336Z 
2025-05-07T20:26:23.6538340Z 
2025-05-07T20:26:23.6540167Z 
2025-05-07T20:26:23.7143164Z libnpp-12.3.1.54     | 93.4 MB   | ####8      |  49% [A[A[A[A[A[A[A
2025-05-07T20:26:23.7542427Z nsight-compute-2024. | 443.1 MB  | #######    |  70% 
2025-05-07T20:26:23.7542793Z 
2025-05-07T20:26:23.7542800Z 
2025-05-07T20:26:23.7542805Z 
2025-05-07T20:26:23.7542810Z 
2025-05-07T20:26:23.7542815Z 
2025-05-07T20:26:23.7542829Z 
2025-05-07T20:26:23.7545723Z 
2025-05-07T20:26:23.8144097Z libnpp-12.3.1.54     | 93.4 MB   | #####3     |  53% [A[A[A[A[A[A[A
2025-05-07T20:26:23.8543621Z nsight-compute-2024. | 443.1 MB  | #######1   |  71% 
2025-05-07T20:26:23.8543892Z 
2025-05-07T20:26:23.8543896Z 
2025-05-07T20:26:23.8543899Z 
2025-05-07T20:26:23.8543903Z 
2025-05-07T20:26:23.8543906Z 
2025-05-07T20:26:23.8543910Z 
2025-05-07T20:26:23.8547635Z 
2025-05-07T20:26:23.9146500Z libnpp-12.3.1.54     | 93.4 MB   | #####7     |  58% [A[A[A[A[A[A[A
2025-05-07T20:26:23.9545958Z nsight-compute-2024. | 443.1 MB  | #######2   |  72% 
2025-05-07T20:26:23.9546230Z 
2025-05-07T20:26:23.9546234Z 
2025-05-07T20:26:23.9546237Z 
2025-05-07T20:26:23.9546241Z 
2025-05-07T20:26:23.9546244Z 
2025-05-07T20:26:23.9546248Z 
2025-05-07T20:26:23.9547875Z 
2025-05-07T20:26:24.0147443Z libnpp-12.3.1.54     | 93.4 MB   | ######1    |  62% [A[A[A[A[A[A[A
2025-05-07T20:26:24.0608692Z nsight-compute-2024. | 443.1 MB  | #######2   |  73% 
2025-05-07T20:26:24.0608983Z 
2025-05-07T20:26:24.0608988Z 
2025-05-07T20:26:24.0608993Z 
2025-05-07T20:26:24.0609012Z 
2025-05-07T20:26:24.0609017Z 
2025-05-07T20:26:24.0609022Z 
2025-05-07T20:26:24.0609214Z 
2025-05-07T20:26:24.1156986Z libnpp-12.3.1.54     | 93.4 MB   | ######5    |  65% [A[A[A[A[A[A[A
2025-05-07T20:26:24.1757660Z nsight-compute-2024. | 443.1 MB  | #######3   |  74% 
2025-05-07T20:26:24.1757985Z 
2025-05-07T20:26:24.1757989Z 
2025-05-07T20:26:24.1757993Z 
2025-05-07T20:26:24.1757996Z 
2025-05-07T20:26:24.1758000Z 
2025-05-07T20:26:24.1758018Z 
2025-05-07T20:26:24.1758022Z 
2025-05-07T20:26:24.2168178Z libnpp-12.3.1.54     | 93.4 MB   | ######9    |  69% [A[A[A[A[A[A[A
2025-05-07T20:26:24.3146634Z nsight-compute-2024. | 443.1 MB  | #######4   |  74% 
2025-05-07T20:26:24.3146931Z 
2025-05-07T20:26:24.3146937Z 
2025-05-07T20:26:24.3146942Z 
2025-05-07T20:26:24.3146951Z 
2025-05-07T20:26:24.3146956Z 
2025-05-07T20:26:24.3146963Z 
2025-05-07T20:26:24.3155224Z 
2025-05-07T20:26:24.3172313Z libnpp-12.3.1.54     | 93.4 MB   | #######2   |  73% [A[A[A[A[A[A[A
2025-05-07T20:26:24.4149875Z nsight-compute-2024. | 443.1 MB  | #######5   |  75% 
2025-05-07T20:26:24.4150152Z 
2025-05-07T20:26:24.4150157Z 
2025-05-07T20:26:24.4150161Z 
2025-05-07T20:26:24.4150165Z 
2025-05-07T20:26:24.4150168Z 
2025-05-07T20:26:24.4150172Z 
2025-05-07T20:26:24.4152646Z 
2025-05-07T20:26:24.4251484Z libnpp-12.3.1.54     | 93.4 MB   | #######6   |  77% [A[A[A[A[A[A[A
2025-05-07T20:26:24.5150409Z nsight-compute-2024. | 443.1 MB  | #######6   |  76% 
2025-05-07T20:26:24.5150705Z 
2025-05-07T20:26:24.5150709Z 
2025-05-07T20:26:24.5150713Z 
2025-05-07T20:26:24.5150716Z 
2025-05-07T20:26:24.5150720Z 
2025-05-07T20:26:24.5150724Z 
2025-05-07T20:26:24.5150727Z 
2025-05-07T20:26:24.5281968Z libnpp-12.3.1.54     | 93.4 MB   | ########   |  80% [A[A[A[A[A[A[A
2025-05-07T20:26:24.6191397Z nsight-compute-2024. | 443.1 MB  | #######6   |  77% 
2025-05-07T20:26:24.6191753Z 
2025-05-07T20:26:24.6191759Z 
2025-05-07T20:26:24.6191764Z 
2025-05-07T20:26:24.6191770Z 
2025-05-07T20:26:24.6191775Z 
2025-05-07T20:26:24.6192092Z 
2025-05-07T20:26:24.6192099Z 
2025-05-07T20:26:24.6282051Z libnpp-12.3.1.54     | 93.4 MB   | ########3  |  84% [A[A[A[A[A[A[A
2025-05-07T20:26:24.7285256Z nsight-compute-2024. | 443.1 MB  | #######7   |  78% 
2025-05-07T20:26:24.7942852Z nsight-compute-2024. | 443.1 MB  | #######8   |  79% 
2025-05-07T20:26:24.7943246Z 
2025-05-07T20:26:24.7943251Z 
2025-05-07T20:26:24.7943256Z 
2025-05-07T20:26:24.7943261Z 
2025-05-07T20:26:24.7943288Z 
2025-05-07T20:26:24.7943293Z 
2025-05-07T20:26:24.7946698Z 
2025-05-07T20:26:24.8284615Z libnpp-12.3.1.54     | 93.4 MB   | ########7  |  88% [A[A[A[A[A[A[A
2025-05-07T20:26:24.8953464Z nsight-compute-2024. | 443.1 MB  | #######9   |  79% 
2025-05-07T20:26:24.8953745Z 
2025-05-07T20:26:24.8953749Z 
2025-05-07T20:26:24.8953752Z 
2025-05-07T20:26:24.8953756Z 
2025-05-07T20:26:24.8953759Z 
2025-05-07T20:26:24.8953764Z 
2025-05-07T20:26:24.8953768Z 
2025-05-07T20:26:24.9400786Z libnpp-12.3.1.54     | 93.4 MB   | #########1 |  91% [A[A[A[A[A[A[A
2025-05-07T20:26:24.9953983Z nsight-compute-2024. | 443.1 MB  | ########   |  80% 
2025-05-07T20:26:24.9954255Z 
2025-05-07T20:26:24.9954493Z 
2025-05-07T20:26:24.9954502Z 
2025-05-07T20:26:24.9954509Z 
2025-05-07T20:26:24.9954515Z 
2025-05-07T20:26:24.9954520Z 
2025-05-07T20:26:24.9958650Z 
2025-05-07T20:26:25.0404103Z libnpp-12.3.1.54     | 93.4 MB   | #########4 |  95% [A[A[A[A[A[A[A
2025-05-07T20:26:25.1404762Z nsight-compute-2024. | 443.1 MB  | ########1  |  81% 
2025-05-07T20:26:25.1439352Z nsight-compute-2024. | 443.1 MB  | ########2  |  82% 
2025-05-07T20:26:25.1439615Z 
2025-05-07T20:26:25.1439897Z 
2025-05-07T20:26:25.1439901Z 
2025-05-07T20:26:25.1439905Z 
2025-05-07T20:26:25.1440072Z 
2025-05-07T20:26:25.1440078Z 
2025-05-07T20:26:25.1440496Z 
2025-05-07T20:26:25.2417264Z libnpp-12.3.1.54     | 93.4 MB   | #########8 |  98% [A[A[A[A[A[A[A
2025-05-07T20:26:25.3419966Z nsight-compute-2024. | 443.1 MB  | ########2  |  83% 
2025-05-07T20:26:25.4420692Z nsight-compute-2024. | 443.1 MB  | ########3  |  84% 
2025-05-07T20:26:25.5422574Z nsight-compute-2024. | 443.1 MB  | ########4  |  85% 
2025-05-07T20:26:25.6426162Z nsight-compute-2024. | 443.1 MB  | ########5  |  86% 
2025-05-07T20:26:25.7429833Z nsight-compute-2024. | 443.1 MB  | ########6  |  87% 
2025-05-07T20:26:25.8437424Z nsight-compute-2024. | 443.1 MB  | ########7  |  88% 
2025-05-07T20:26:25.9461465Z nsight-compute-2024. | 443.1 MB  | ########8  |  89% 
2025-05-07T20:26:25.9762646Z nsight-compute-2024. | 443.1 MB  | ########9  |  90% 
2025-05-07T20:26:25.9763023Z 
2025-05-07T20:26:25.9763029Z 
2025-05-07T20:26:25.9763034Z 
2025-05-07T20:26:25.9763048Z 
2025-05-07T20:26:25.9763053Z 
2025-05-07T20:26:25.9763058Z 
2025-05-07T20:26:26.0182836Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:26.0183238Z 
2025-05-07T20:26:26.0183244Z 
2025-05-07T20:26:26.0183258Z 
2025-05-07T20:26:26.0183264Z 
2025-05-07T20:26:26.0183269Z 
2025-05-07T20:26:26.0183274Z 
2025-05-07T20:26:26.0183278Z 
2025-05-07T20:26:26.0183308Z 
2025-05-07T20:26:26.0522427Z cuda-nvdisasm-12.6.7 | 47.6 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:26:26.1186595Z nsight-compute-2024. | 443.1 MB  | #########  |  91% 
2025-05-07T20:26:26.1186955Z 
2025-05-07T20:26:26.1186961Z 
2025-05-07T20:26:26.1186966Z 
2025-05-07T20:26:26.1186971Z 
2025-05-07T20:26:26.1186984Z 
2025-05-07T20:26:26.1186990Z 
2025-05-07T20:26:26.1186995Z 
2025-05-07T20:26:26.1187000Z 
2025-05-07T20:26:26.1766784Z cuda-nvdisasm-12.6.7 | 47.6 MB   | 6          |   7% [A[A[A[A[A[A[A[A
2025-05-07T20:26:26.2241225Z nsight-compute-2024. | 443.1 MB  | #########1 |  92% 
2025-05-07T20:26:26.2241604Z 
2025-05-07T20:26:26.2241610Z 
2025-05-07T20:26:26.2241615Z 
2025-05-07T20:26:26.2241621Z 
2025-05-07T20:26:26.2241626Z 
2025-05-07T20:26:26.2241631Z 
2025-05-07T20:26:26.2241652Z 
2025-05-07T20:26:26.2246620Z 
2025-05-07T20:26:26.2966001Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #3         |  14% [A[A[A[A[A[A[A[A
2025-05-07T20:26:26.3318821Z nsight-compute-2024. | 443.1 MB  | #########2 |  92% 
2025-05-07T20:26:26.3319501Z 
2025-05-07T20:26:26.3319507Z 
2025-05-07T20:26:26.3319512Z 
2025-05-07T20:26:26.3319517Z 
2025-05-07T20:26:26.3319522Z 
2025-05-07T20:26:26.3319527Z 
2025-05-07T20:26:26.3319532Z 
2025-05-07T20:26:26.3321631Z 
2025-05-07T20:26:26.4247632Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ##         |  21% [A[A[A[A[A[A[A[A
2025-05-07T20:26:26.4320279Z nsight-compute-2024. | 443.1 MB  | #########3 |  93% 
2025-05-07T20:26:26.4320650Z 
2025-05-07T20:26:26.4320790Z 
2025-05-07T20:26:26.4320796Z 
2025-05-07T20:26:26.4320805Z 
2025-05-07T20:26:26.4320810Z 
2025-05-07T20:26:26.4320815Z 
2025-05-07T20:26:26.4320820Z 
2025-05-07T20:26:26.4324869Z 
2025-05-07T20:26:26.5321344Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ##7        |  27% [A[A[A[A[A[A[A[A
2025-05-07T20:26:26.5321781Z 
2025-05-07T20:26:26.5321794Z 
2025-05-07T20:26:26.5321798Z 
2025-05-07T20:26:26.5321802Z 
2025-05-07T20:26:26.5321805Z 
2025-05-07T20:26:26.5321809Z 
2025-05-07T20:26:26.5321814Z 
2025-05-07T20:26:26.5321842Z 
2025-05-07T20:26:26.5381950Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ###3       |  34% [A[A[A[A[A[A[A[A
2025-05-07T20:26:26.6450896Z nsight-compute-2024. | 443.1 MB  | #########4 |  94% 
2025-05-07T20:26:26.6451327Z 
2025-05-07T20:26:26.6451333Z 
2025-05-07T20:26:26.6451338Z 
2025-05-07T20:26:26.6451343Z 
2025-05-07T20:26:26.6451360Z 
2025-05-07T20:26:26.6451366Z 
2025-05-07T20:26:26.6451371Z 
2025-05-07T20:26:26.6451416Z 
2025-05-07T20:26:26.6605450Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ####       |  41% [A[A[A[A[A[A[A[A
2025-05-07T20:26:26.7477165Z nsight-compute-2024. | 443.1 MB  | #########4 |  95% 
2025-05-07T20:26:26.7477530Z 
2025-05-07T20:26:26.7477536Z 
2025-05-07T20:26:26.7477541Z 
2025-05-07T20:26:26.7477546Z 
2025-05-07T20:26:26.7477551Z 
2025-05-07T20:26:26.7477557Z 
2025-05-07T20:26:26.7477562Z 
2025-05-07T20:26:26.7479316Z 
2025-05-07T20:26:26.7695193Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:26:26.8483159Z nsight-compute-2024. | 443.1 MB  | #########5 |  96% 
2025-05-07T20:26:26.8483571Z 
2025-05-07T20:26:26.8483577Z 
2025-05-07T20:26:26.8483582Z 
2025-05-07T20:26:26.8483587Z 
2025-05-07T20:26:26.8483592Z 
2025-05-07T20:26:26.8483597Z 
2025-05-07T20:26:26.8483602Z 
2025-05-07T20:26:26.8486118Z 
2025-05-07T20:26:26.8744717Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #####3     |  54% [A[A[A[A[A[A[A[A
2025-05-07T20:26:26.8895498Z nsight-compute-2024. | 443.1 MB  | #########6 |  96% 
2025-05-07T20:26:26.8895852Z 
2025-05-07T20:26:26.8895867Z 
2025-05-07T20:26:26.8895873Z 
2025-05-07T20:26:26.8895878Z 
2025-05-07T20:26:26.8898689Z 
2025-05-07T20:26:26.9240655Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:26.9241086Z 
2025-05-07T20:26:26.9241093Z 
2025-05-07T20:26:26.9241099Z 
2025-05-07T20:26:26.9241105Z 
2025-05-07T20:26:26.9241113Z 
2025-05-07T20:26:26.9241119Z 
2025-05-07T20:26:26.9241126Z 
2025-05-07T20:26:26.9241132Z 
2025-05-07T20:26:26.9241137Z 
2025-05-07T20:26:26.9484180Z libcurand-10.3.7.77  | 39.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:26.9484608Z 
2025-05-07T20:26:26.9484613Z 
2025-05-07T20:26:26.9484618Z 
2025-05-07T20:26:26.9484629Z 
2025-05-07T20:26:26.9484634Z 
2025-05-07T20:26:26.9484639Z 
2025-05-07T20:26:26.9484644Z 
2025-05-07T20:26:26.9487123Z 
2025-05-07T20:26:26.9799963Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ######     |  60% [A[A[A[A[A[A[A[A
2025-05-07T20:26:27.0240585Z nsight-compute-2024. | 443.1 MB  | #########6 |  97% 
2025-05-07T20:26:27.0240961Z 
2025-05-07T20:26:27.0240967Z 
2025-05-07T20:26:27.0240972Z 
2025-05-07T20:26:27.0240978Z 
2025-05-07T20:26:27.0240985Z 
2025-05-07T20:26:27.0240991Z 
2025-05-07T20:26:27.0240997Z 
2025-05-07T20:26:27.0241004Z 
2025-05-07T20:26:27.0243872Z 
2025-05-07T20:26:27.0602343Z libcurand-10.3.7.77  | 39.9 MB   | 7          |   7% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.0602758Z 
2025-05-07T20:26:27.0602764Z 
2025-05-07T20:26:27.0602769Z 
2025-05-07T20:26:27.0602773Z 
2025-05-07T20:26:27.0603086Z 
2025-05-07T20:26:27.0603091Z 
2025-05-07T20:26:27.0603110Z 
2025-05-07T20:26:27.0603118Z 
2025-05-07T20:26:27.0834049Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ######6    |  67% [A[A[A[A[A[A[A[A
2025-05-07T20:26:27.1241461Z nsight-compute-2024. | 443.1 MB  | #########7 |  98% 
2025-05-07T20:26:27.1241865Z 
2025-05-07T20:26:27.1241871Z 
2025-05-07T20:26:27.1241876Z 
2025-05-07T20:26:27.1241881Z 
2025-05-07T20:26:27.1241900Z 
2025-05-07T20:26:27.1241906Z 
2025-05-07T20:26:27.1241911Z 
2025-05-07T20:26:27.1241916Z 
2025-05-07T20:26:27.1241921Z 
2025-05-07T20:26:27.1667944Z libcurand-10.3.7.77  | 39.9 MB   | #4         |  14% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.1668370Z 
2025-05-07T20:26:27.1668376Z 
2025-05-07T20:26:27.1668381Z 
2025-05-07T20:26:27.1668388Z 
2025-05-07T20:26:27.1668393Z 
2025-05-07T20:26:27.1668398Z 
2025-05-07T20:26:27.1668404Z 
2025-05-07T20:26:27.1668409Z 
2025-05-07T20:26:27.2242312Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #######2   |  73% [A[A[A[A[A[A[A[A
2025-05-07T20:26:27.2242786Z 
2025-05-07T20:26:27.2242790Z 
2025-05-07T20:26:27.2242794Z 
2025-05-07T20:26:27.2242797Z 
2025-05-07T20:26:27.2242800Z 
2025-05-07T20:26:27.2242804Z 
2025-05-07T20:26:27.2242807Z 
2025-05-07T20:26:27.2242822Z 
2025-05-07T20:26:27.2242825Z 
2025-05-07T20:26:27.2672438Z libcurand-10.3.7.77  | 39.9 MB   | ##2        |  23% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.2672858Z 
2025-05-07T20:26:27.2672886Z 
2025-05-07T20:26:27.2672892Z 
2025-05-07T20:26:27.2672897Z 
2025-05-07T20:26:27.2672902Z 
2025-05-07T20:26:27.2672924Z 
2025-05-07T20:26:27.2672929Z 
2025-05-07T20:26:27.2672934Z 
2025-05-07T20:26:27.2823272Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #######9   |  80% [A[A[A[A[A[A[A[A
2025-05-07T20:26:27.3303799Z nsight-compute-2024. | 443.1 MB  | #########8 |  98% 
2025-05-07T20:26:27.3304088Z 
2025-05-07T20:26:27.3304096Z 
2025-05-07T20:26:27.3304103Z 
2025-05-07T20:26:27.3304108Z 
2025-05-07T20:26:27.3304113Z 
2025-05-07T20:26:27.3304119Z 
2025-05-07T20:26:27.3304125Z 
2025-05-07T20:26:27.3304161Z 
2025-05-07T20:26:27.3306242Z 
2025-05-07T20:26:27.3712145Z libcurand-10.3.7.77  | 39.9 MB   | ###        |  30% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.3712473Z 
2025-05-07T20:26:27.3712478Z 
2025-05-07T20:26:27.3712483Z 
2025-05-07T20:26:27.3712505Z 
2025-05-07T20:26:27.3712511Z 
2025-05-07T20:26:27.3712516Z 
2025-05-07T20:26:27.3712522Z 
2025-05-07T20:26:27.3714854Z 
2025-05-07T20:26:27.3895888Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########6  |  86% [A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4464583Z nsight-compute-2024. | 443.1 MB  | #########8 |  99% 
2025-05-07T20:26:27.4464907Z 
2025-05-07T20:26:27.4464911Z 
2025-05-07T20:26:27.4464915Z 
2025-05-07T20:26:27.4464920Z 
2025-05-07T20:26:27.4464932Z 
2025-05-07T20:26:27.4464936Z 
2025-05-07T20:26:27.4464940Z 
2025-05-07T20:26:27.4464943Z 
2025-05-07T20:26:27.4464947Z 
2025-05-07T20:26:27.4752922Z libcurand-10.3.7.77  | 39.9 MB   | ###7       |  38% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4753293Z 
2025-05-07T20:26:27.4753339Z 
2025-05-07T20:26:27.4753344Z 
2025-05-07T20:26:27.4753349Z 
2025-05-07T20:26:27.4753354Z 
2025-05-07T20:26:27.4753360Z 
2025-05-07T20:26:27.4753365Z 
2025-05-07T20:26:27.4753372Z 
2025-05-07T20:26:27.4898900Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #########2 |  93% [A[A[A[A[A[A[A[A
2025-05-07T20:26:27.5470363Z nsight-compute-2024. | 443.1 MB  | #########9 |  99% 
2025-05-07T20:26:27.5470646Z 
2025-05-07T20:26:27.5470942Z 
2025-05-07T20:26:27.5470965Z 
2025-05-07T20:26:27.5470970Z 
2025-05-07T20:26:27.5470975Z 
2025-05-07T20:26:27.5470980Z 
2025-05-07T20:26:27.5470985Z 
2025-05-07T20:26:27.5470990Z 
2025-05-07T20:26:27.5470995Z 
2025-05-07T20:26:27.5764894Z libcurand-10.3.7.77  | 39.9 MB   | ####4      |  45% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.5765324Z 
2025-05-07T20:26:27.5765329Z 
2025-05-07T20:26:27.5765332Z 
2025-05-07T20:26:27.5765336Z 
2025-05-07T20:26:27.5765339Z 
2025-05-07T20:26:27.5765343Z 
2025-05-07T20:26:27.5765346Z 
2025-05-07T20:26:27.5765350Z 
2025-05-07T20:26:27.5909847Z cuda-nvdisasm-12.6.7 | 47.6 MB   | #########8 |  99% [A[A[A[A[A[A[A[A
2025-05-07T20:26:27.6472760Z nsight-compute-2024. | 443.1 MB  | #########9 | 100% 
2025-05-07T20:26:27.6473064Z 
2025-05-07T20:26:27.6473068Z 
2025-05-07T20:26:27.6473072Z 
2025-05-07T20:26:27.6473076Z 
2025-05-07T20:26:27.6473080Z 
2025-05-07T20:26:27.6473083Z 
2025-05-07T20:26:27.6473087Z 
2025-05-07T20:26:27.6473090Z 
2025-05-07T20:26:27.6473790Z 
2025-05-07T20:26:27.7476179Z libcurand-10.3.7.77  | 39.9 MB   | #####2     |  53% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.7476490Z 
2025-05-07T20:26:27.7476493Z 
2025-05-07T20:26:27.7476497Z 
2025-05-07T20:26:27.7476500Z 
2025-05-07T20:26:27.7476504Z 
2025-05-07T20:26:27.7476507Z 
2025-05-07T20:26:27.7476511Z 
2025-05-07T20:26:27.7476515Z 
2025-05-07T20:26:27.7476675Z 
2025-05-07T20:26:27.8481106Z libcurand-10.3.7.77  | 39.9 MB   | ######     |  61% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.8481429Z 
2025-05-07T20:26:27.8481433Z 
2025-05-07T20:26:27.8481436Z 
2025-05-07T20:26:27.8481473Z 
2025-05-07T20:26:27.8481476Z 
2025-05-07T20:26:27.8481480Z 
2025-05-07T20:26:27.8481483Z 
2025-05-07T20:26:27.8481487Z 
2025-05-07T20:26:27.8481945Z 
2025-05-07T20:26:27.9482367Z libcurand-10.3.7.77  | 39.9 MB   | ######9    |  70% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.9482706Z 
2025-05-07T20:26:27.9482713Z 
2025-05-07T20:26:27.9482717Z 
2025-05-07T20:26:27.9482721Z 
2025-05-07T20:26:27.9482733Z 
2025-05-07T20:26:27.9482771Z 
2025-05-07T20:26:27.9482775Z 
2025-05-07T20:26:27.9482780Z 
2025-05-07T20:26:27.9484213Z 
2025-05-07T20:26:28.0491624Z libcurand-10.3.7.77  | 39.9 MB   | #######8   |  79% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.0491955Z 
2025-05-07T20:26:28.0491960Z 
2025-05-07T20:26:28.0491963Z 
2025-05-07T20:26:28.0491967Z 
2025-05-07T20:26:28.0491971Z 
2025-05-07T20:26:28.0491974Z 
2025-05-07T20:26:28.0491977Z 
2025-05-07T20:26:28.0491981Z 
2025-05-07T20:26:28.0492649Z 
2025-05-07T20:26:28.1496276Z libcurand-10.3.7.77  | 39.9 MB   | ########7  |  87% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.1496729Z 
2025-05-07T20:26:28.1496733Z 
2025-05-07T20:26:28.1496745Z 
2025-05-07T20:26:28.1496749Z 
2025-05-07T20:26:28.1496752Z 
2025-05-07T20:26:28.1496756Z 
2025-05-07T20:26:28.1496759Z 
2025-05-07T20:26:28.1496763Z 
2025-05-07T20:26:28.1498757Z 
2025-05-07T20:26:28.5061640Z libcurand-10.3.7.77  | 39.9 MB   | #########7 |  97% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.5061988Z 
2025-05-07T20:26:28.5062029Z 
2025-05-07T20:26:28.5062033Z 
2025-05-07T20:26:28.5062037Z 
2025-05-07T20:26:28.5062040Z 
2025-05-07T20:26:28.5062043Z 
2025-05-07T20:26:28.5062047Z 
2025-05-07T20:26:28.5346770Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:26:28.5347093Z 
2025-05-07T20:26:28.5347099Z 
2025-05-07T20:26:28.5347156Z 
2025-05-07T20:26:28.5594488Z libcusparse-12.5.4.2 | 118.6 MB  | ########## | 100% [A[A[A
2025-05-07T20:26:28.5594784Z 
2025-05-07T20:26:28.5594787Z 
2025-05-07T20:26:28.5594791Z 
2025-05-07T20:26:28.5594794Z 
2025-05-07T20:26:28.5594822Z 
2025-05-07T20:26:28.5594827Z 
2025-05-07T20:26:28.5594830Z 
2025-05-07T20:26:28.5594834Z 
2025-05-07T20:26:28.5594837Z 
2025-05-07T20:26:28.5596229Z 
2025-05-07T20:26:28.6595995Z gds-tools-1.11.1.6   | 37.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.6596329Z 
2025-05-07T20:26:28.6596333Z 
2025-05-07T20:26:28.6596337Z 
2025-05-07T20:26:28.6596341Z 
2025-05-07T20:26:28.6596598Z 
2025-05-07T20:26:28.6596602Z 
2025-05-07T20:26:28.6596606Z 
2025-05-07T20:26:28.6596609Z 
2025-05-07T20:26:28.6596613Z 
2025-05-07T20:26:28.6596616Z 
2025-05-07T20:26:28.7598338Z gds-tools-1.11.1.6   | 37.8 MB   | 8          |   8% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.7598675Z 
2025-05-07T20:26:28.7598679Z 
2025-05-07T20:26:28.7598683Z 
2025-05-07T20:26:28.7598687Z 
2025-05-07T20:26:28.7598691Z 
2025-05-07T20:26:28.7598695Z 
2025-05-07T20:26:28.7598700Z 
2025-05-07T20:26:28.7598703Z 
2025-05-07T20:26:28.7598707Z 
2025-05-07T20:26:28.7599685Z 
2025-05-07T20:26:28.8602904Z gds-tools-1.11.1.6   | 37.8 MB   | #6         |  17% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.8603245Z 
2025-05-07T20:26:28.8603249Z 
2025-05-07T20:26:28.8603253Z 
2025-05-07T20:26:28.8603256Z 
2025-05-07T20:26:28.8603260Z 
2025-05-07T20:26:28.8603263Z 
2025-05-07T20:26:28.8603267Z 
2025-05-07T20:26:28.8603271Z 
2025-05-07T20:26:28.8603274Z 
2025-05-07T20:26:28.8603551Z 
2025-05-07T20:26:28.9607726Z gds-tools-1.11.1.6   | 37.8 MB   | ##5        |  25% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.9608062Z 
2025-05-07T20:26:28.9608066Z 
2025-05-07T20:26:28.9608070Z 
2025-05-07T20:26:28.9608075Z 
2025-05-07T20:26:28.9608078Z 
2025-05-07T20:26:28.9608082Z 
2025-05-07T20:26:28.9608085Z 
2025-05-07T20:26:28.9608089Z 
2025-05-07T20:26:28.9608092Z 
2025-05-07T20:26:28.9610130Z 
2025-05-07T20:26:29.0609386Z gds-tools-1.11.1.6   | 37.8 MB   | ###4       |  35% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.0609714Z 
2025-05-07T20:26:29.0609718Z 
2025-05-07T20:26:29.0609729Z 
2025-05-07T20:26:29.0609765Z 
2025-05-07T20:26:29.0609770Z 
2025-05-07T20:26:29.0609775Z 
2025-05-07T20:26:29.0609780Z 
2025-05-07T20:26:29.0609787Z 
2025-05-07T20:26:29.0609792Z 
2025-05-07T20:26:29.0613352Z 
2025-05-07T20:26:29.1612365Z gds-tools-1.11.1.6   | 37.8 MB   | ####4      |  44% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.1612702Z 
2025-05-07T20:26:29.1612706Z 
2025-05-07T20:26:29.1612710Z 
2025-05-07T20:26:29.1612713Z 
2025-05-07T20:26:29.1612752Z 
2025-05-07T20:26:29.1612755Z 
2025-05-07T20:26:29.1612759Z 
2025-05-07T20:26:29.1612763Z 
2025-05-07T20:26:29.1612767Z 
2025-05-07T20:26:29.1612770Z 
2025-05-07T20:26:29.2027857Z gds-tools-1.11.1.6   | 37.8 MB   | #####4     |  54% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.2028360Z 
2025-05-07T20:26:29.2090354Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:26:29.2090630Z 
2025-05-07T20:26:29.2090634Z 
2025-05-07T20:26:29.2090638Z 
2025-05-07T20:26:29.2090641Z 
2025-05-07T20:26:29.2090645Z 
2025-05-07T20:26:29.2090649Z 
2025-05-07T20:26:29.2090682Z 
2025-05-07T20:26:29.2092455Z 
2025-05-07T20:26:29.2507589Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:26:29.2507905Z 
2025-05-07T20:26:29.2507909Z 
2025-05-07T20:26:29.2507922Z 
2025-05-07T20:26:29.2507926Z 
2025-05-07T20:26:29.2507929Z 
2025-05-07T20:26:29.2507937Z 
2025-05-07T20:26:29.2507940Z 
2025-05-07T20:26:29.2507944Z 
2025-05-07T20:26:29.2507948Z 
2025-05-07T20:26:29.2507969Z 
2025-05-07T20:26:29.2507973Z 
2025-05-07T20:26:29.2508849Z 
2025-05-07T20:26:29.2612437Z cuda-nvrtc-12.6.85   | 17.3 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.2612837Z 
2025-05-07T20:26:29.2612841Z 
2025-05-07T20:26:29.2612845Z 
2025-05-07T20:26:29.2612848Z 
2025-05-07T20:26:29.2612852Z 
2025-05-07T20:26:29.2612855Z 
2025-05-07T20:26:29.2612859Z 
2025-05-07T20:26:29.2612862Z 
2025-05-07T20:26:29.2612866Z 
2025-05-07T20:26:29.2612869Z 
2025-05-07T20:26:29.2623029Z gds-tools-1.11.1.6   | 37.8 MB   | ######4    |  64% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.2623348Z 
2025-05-07T20:26:29.2623352Z 
2025-05-07T20:26:29.2623356Z 
2025-05-07T20:26:29.2623359Z 
2025-05-07T20:26:29.2623363Z 
2025-05-07T20:26:29.2623366Z 
2025-05-07T20:26:29.2623370Z 
2025-05-07T20:26:29.2623373Z 
2025-05-07T20:26:29.2623377Z 
2025-05-07T20:26:29.2623380Z 
2025-05-07T20:26:29.2625973Z 
2025-05-07T20:26:29.3512801Z cuda-nvcc-tools-12.6 | 23.0 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.3513171Z 
2025-05-07T20:26:29.3513175Z 
2025-05-07T20:26:29.3513179Z 
2025-05-07T20:26:29.3513184Z 
2025-05-07T20:26:29.3513187Z 
2025-05-07T20:26:29.3513191Z 
2025-05-07T20:26:29.3513194Z 
2025-05-07T20:26:29.3513198Z 
2025-05-07T20:26:29.3513201Z 
2025-05-07T20:26:29.3513205Z 
2025-05-07T20:26:29.3513208Z 
2025-05-07T20:26:29.3514678Z 
2025-05-07T20:26:29.3620576Z cuda-nvrtc-12.6.85   | 17.3 MB   | #6         |  16% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.3620957Z 
2025-05-07T20:26:29.3620961Z 
2025-05-07T20:26:29.3621236Z 
2025-05-07T20:26:29.3621240Z 
2025-05-07T20:26:29.3621243Z 
2025-05-07T20:26:29.3621247Z 
2025-05-07T20:26:29.3621250Z 
2025-05-07T20:26:29.3621254Z 
2025-05-07T20:26:29.3621257Z 
2025-05-07T20:26:29.3621261Z 
2025-05-07T20:26:29.3626340Z 
2025-05-07T20:26:29.3879947Z cuda-nvcc-tools-12.6 | 23.0 MB   | #1         |  12% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.3880343Z 
2025-05-07T20:26:29.3880347Z 
2025-05-07T20:26:29.3880368Z 
2025-05-07T20:26:29.3880372Z 
2025-05-07T20:26:29.3880376Z 
2025-05-07T20:26:29.3880390Z 
2025-05-07T20:26:29.3880393Z 
2025-05-07T20:26:29.3880397Z 
2025-05-07T20:26:29.3880400Z 
2025-05-07T20:26:29.3880404Z 
2025-05-07T20:26:29.4515601Z gds-tools-1.11.1.6   | 37.8 MB   | #######3   |  74% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.4515935Z 
2025-05-07T20:26:29.4515939Z 
2025-05-07T20:26:29.4515942Z 
2025-05-07T20:26:29.4515946Z 
2025-05-07T20:26:29.4515951Z 
2025-05-07T20:26:29.4515956Z 
2025-05-07T20:26:29.4515959Z 
2025-05-07T20:26:29.4515993Z 
2025-05-07T20:26:29.4515997Z 
2025-05-07T20:26:29.4516001Z 
2025-05-07T20:26:29.4516004Z 
2025-05-07T20:26:29.4519054Z 
2025-05-07T20:26:29.4634707Z cuda-nvrtc-12.6.85   | 17.3 MB   | ###3       |  34% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.4635036Z 
2025-05-07T20:26:29.4635040Z 
2025-05-07T20:26:29.4635043Z 
2025-05-07T20:26:29.4635047Z 
2025-05-07T20:26:29.4635050Z 
2025-05-07T20:26:29.4635054Z 
2025-05-07T20:26:29.4635074Z 
2025-05-07T20:26:29.4635077Z 
2025-05-07T20:26:29.4635081Z 
2025-05-07T20:26:29.4635084Z 
2025-05-07T20:26:29.4635088Z 
2025-05-07T20:26:29.5040584Z cuda-nvcc-tools-12.6 | 23.0 MB   | ##3        |  23% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.5040920Z 
2025-05-07T20:26:29.5040924Z 
2025-05-07T20:26:29.5040927Z 
2025-05-07T20:26:29.5040931Z 
2025-05-07T20:26:29.5040935Z 
2025-05-07T20:26:29.5040938Z 
2025-05-07T20:26:29.5040942Z 
2025-05-07T20:26:29.5040952Z 
2025-05-07T20:26:29.5040956Z 
2025-05-07T20:26:29.5098820Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.5099152Z 
2025-05-07T20:26:29.5099157Z 
2025-05-07T20:26:29.5099165Z 
2025-05-07T20:26:29.5099169Z 
2025-05-07T20:26:29.5099172Z 
2025-05-07T20:26:29.5099176Z 
2025-05-07T20:26:29.5099180Z 
2025-05-07T20:26:29.5099185Z 
2025-05-07T20:26:29.5099188Z 
2025-05-07T20:26:29.5100972Z 
2025-05-07T20:26:29.5531958Z gds-tools-1.11.1.6   | 37.8 MB   | ########2  |  83% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.5532336Z 
2025-05-07T20:26:29.5532340Z 
2025-05-07T20:26:29.5532344Z 
2025-05-07T20:26:29.5532347Z 
2025-05-07T20:26:29.5532351Z 
2025-05-07T20:26:29.5532354Z 
2025-05-07T20:26:29.5532358Z 
2025-05-07T20:26:29.5532362Z 
2025-05-07T20:26:29.5532365Z 
2025-05-07T20:26:29.5532369Z 
2025-05-07T20:26:29.5532384Z 
2025-05-07T20:26:29.5532388Z 
2025-05-07T20:26:29.5532391Z 
2025-05-07T20:26:29.5634706Z libnvjitlink-12.6.85 | 14.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.5635048Z 
2025-05-07T20:26:29.5635083Z 
2025-05-07T20:26:29.5635087Z 
2025-05-07T20:26:29.5635090Z 
2025-05-07T20:26:29.5635093Z 
2025-05-07T20:26:29.5635097Z 
2025-05-07T20:26:29.5635100Z 
2025-05-07T20:26:29.5635104Z 
2025-05-07T20:26:29.5635107Z 
2025-05-07T20:26:29.5635111Z 
2025-05-07T20:26:29.5635114Z 
2025-05-07T20:26:29.5703139Z cuda-nvcc-tools-12.6 | 23.0 MB   | ###6       |  36% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.5703476Z 
2025-05-07T20:26:29.5703713Z 
2025-05-07T20:26:29.5703718Z 
2025-05-07T20:26:29.5703721Z 
2025-05-07T20:26:29.5703725Z 
2025-05-07T20:26:29.5703733Z 
2025-05-07T20:26:29.5703739Z 
2025-05-07T20:26:29.5703744Z 
2025-05-07T20:26:29.5703747Z 
2025-05-07T20:26:29.5703751Z 
2025-05-07T20:26:29.5703754Z 
2025-05-07T20:26:29.5711597Z 
2025-05-07T20:26:29.6251607Z cuda-nvrtc-12.6.85   | 17.3 MB   | #####      |  50% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.6251935Z 
2025-05-07T20:26:29.6251939Z 
2025-05-07T20:26:29.6251942Z 
2025-05-07T20:26:29.6251946Z 
2025-05-07T20:26:29.6251951Z 
2025-05-07T20:26:29.6252211Z 
2025-05-07T20:26:29.6252215Z 
2025-05-07T20:26:29.6252218Z 
2025-05-07T20:26:29.6252223Z 
2025-05-07T20:26:29.6252237Z 
2025-05-07T20:26:29.6528762Z gds-tools-1.11.1.6   | 37.8 MB   | #########1 |  91% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.6529147Z 
2025-05-07T20:26:29.6529152Z 
2025-05-07T20:26:29.6529156Z 
2025-05-07T20:26:29.6529169Z 
2025-05-07T20:26:29.6529173Z 
2025-05-07T20:26:29.6529177Z 
2025-05-07T20:26:29.6529194Z 
2025-05-07T20:26:29.6529198Z 
2025-05-07T20:26:29.6529201Z 
2025-05-07T20:26:29.6529205Z 
2025-05-07T20:26:29.6529208Z 
2025-05-07T20:26:29.6529212Z 
2025-05-07T20:26:29.6529215Z 
2025-05-07T20:26:29.6805178Z libnvjitlink-12.6.85 | 14.9 MB   | #7         |  18% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.6805520Z 
2025-05-07T20:26:29.6805524Z 
2025-05-07T20:26:29.6805528Z 
2025-05-07T20:26:29.6805531Z 
2025-05-07T20:26:29.6805535Z 
2025-05-07T20:26:29.6805538Z 
2025-05-07T20:26:29.6805542Z 
2025-05-07T20:26:29.6805545Z 
2025-05-07T20:26:29.6805568Z 
2025-05-07T20:26:29.6805571Z 
2025-05-07T20:26:29.6809094Z 
2025-05-07T20:26:29.6840705Z cuda-nvcc-tools-12.6 | 23.0 MB   | ####8      |  48% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.6841354Z 
2025-05-07T20:26:29.6841358Z 
2025-05-07T20:26:29.6841361Z 
2025-05-07T20:26:29.6841365Z 
2025-05-07T20:26:29.6841368Z 
2025-05-07T20:26:29.6841371Z 
2025-05-07T20:26:29.6841375Z 
2025-05-07T20:26:29.6841378Z 
2025-05-07T20:26:29.6841392Z 
2025-05-07T20:26:29.6841396Z 
2025-05-07T20:26:29.6841451Z 
2025-05-07T20:26:29.6841455Z 
2025-05-07T20:26:29.7430736Z cuda-nvrtc-12.6.85   | 17.3 MB   | ######6    |  66% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.7431171Z 
2025-05-07T20:26:29.7431177Z 
2025-05-07T20:26:29.7431182Z 
2025-05-07T20:26:29.7431187Z 
2025-05-07T20:26:29.7431192Z 
2025-05-07T20:26:29.7431197Z 
2025-05-07T20:26:29.7431204Z 
2025-05-07T20:26:29.7431209Z 
2025-05-07T20:26:29.7431215Z 
2025-05-07T20:26:29.7431221Z 
2025-05-07T20:26:29.7566700Z gds-tools-1.11.1.6   | 37.8 MB   | #########9 | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.7567192Z 
2025-05-07T20:26:29.7567198Z 
2025-05-07T20:26:29.7567203Z 
2025-05-07T20:26:29.7567209Z 
2025-05-07T20:26:29.7567215Z 
2025-05-07T20:26:29.7567220Z 
2025-05-07T20:26:29.7567227Z 
2025-05-07T20:26:29.7567233Z 
2025-05-07T20:26:29.7567238Z 
2025-05-07T20:26:29.7567245Z 
2025-05-07T20:26:29.7567251Z 
2025-05-07T20:26:29.7567257Z 
2025-05-07T20:26:29.7567279Z 
2025-05-07T20:26:29.7893095Z libnvjitlink-12.6.85 | 14.9 MB   | ###5       |  36% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.7893435Z 
2025-05-07T20:26:29.7893439Z 
2025-05-07T20:26:29.7893442Z 
2025-05-07T20:26:29.7893446Z 
2025-05-07T20:26:29.7893449Z 
2025-05-07T20:26:29.7893460Z 
2025-05-07T20:26:29.7893464Z 
2025-05-07T20:26:29.7893467Z 
2025-05-07T20:26:29.7893471Z 
2025-05-07T20:26:29.7893474Z 
2025-05-07T20:26:29.7893478Z 
2025-05-07T20:26:29.7896960Z 
2025-05-07T20:26:29.7979760Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########   |  81% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.7980252Z 
2025-05-07T20:26:29.7980258Z 
2025-05-07T20:26:29.7980264Z 
2025-05-07T20:26:29.7980269Z 
2025-05-07T20:26:29.7980274Z 
2025-05-07T20:26:29.7980280Z 
2025-05-07T20:26:29.7980285Z 
2025-05-07T20:26:29.7980290Z 
2025-05-07T20:26:29.7980295Z 
2025-05-07T20:26:29.7980300Z 
2025-05-07T20:26:29.7983618Z 
2025-05-07T20:26:29.8569057Z cuda-nvcc-tools-12.6 | 23.0 MB   | #####9     |  60% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.8569405Z 
2025-05-07T20:26:29.8569409Z 
2025-05-07T20:26:29.8569413Z 
2025-05-07T20:26:29.8569416Z 
2025-05-07T20:26:29.8569430Z 
2025-05-07T20:26:29.8569434Z 
2025-05-07T20:26:29.8569438Z 
2025-05-07T20:26:29.8569441Z 
2025-05-07T20:26:29.8569445Z 
2025-05-07T20:26:29.8569448Z 
2025-05-07T20:26:29.8569452Z 
2025-05-07T20:26:29.8569455Z 
2025-05-07T20:26:29.8574110Z 
2025-05-07T20:26:29.8900104Z libnvjitlink-12.6.85 | 14.9 MB   | #####4     |  55% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.8900444Z 
2025-05-07T20:26:29.8900708Z 
2025-05-07T20:26:29.8900711Z 
2025-05-07T20:26:29.8900715Z 
2025-05-07T20:26:29.8900719Z 
2025-05-07T20:26:29.8900722Z 
2025-05-07T20:26:29.8900725Z 
2025-05-07T20:26:29.8900729Z 
2025-05-07T20:26:29.8900732Z 
2025-05-07T20:26:29.8900736Z 
2025-05-07T20:26:29.8900739Z 
2025-05-07T20:26:29.8903761Z 
2025-05-07T20:26:29.8983730Z cuda-nvrtc-12.6.85   | 17.3 MB   | #########6 |  97% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.8984062Z 
2025-05-07T20:26:29.8984066Z 
2025-05-07T20:26:29.8984070Z 
2025-05-07T20:26:29.8984073Z 
2025-05-07T20:26:29.8984077Z 
2025-05-07T20:26:29.8984080Z 
2025-05-07T20:26:29.8984084Z 
2025-05-07T20:26:29.8984087Z 
2025-05-07T20:26:29.8984091Z 
2025-05-07T20:26:29.8984095Z 
2025-05-07T20:26:29.8986007Z 
2025-05-07T20:26:29.9573937Z cuda-nvcc-tools-12.6 | 23.0 MB   | #######    |  71% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.9574434Z 
2025-05-07T20:26:29.9574442Z 
2025-05-07T20:26:29.9574448Z 
2025-05-07T20:26:29.9574454Z 
2025-05-07T20:26:29.9574482Z 
2025-05-07T20:26:29.9574488Z 
2025-05-07T20:26:29.9574493Z 
2025-05-07T20:26:29.9574610Z 
2025-05-07T20:26:29.9574618Z 
2025-05-07T20:26:29.9574624Z 
2025-05-07T20:26:29.9574631Z 
2025-05-07T20:26:29.9574637Z 
2025-05-07T20:26:29.9574656Z 
2025-05-07T20:26:29.9995881Z libnvjitlink-12.6.85 | 14.9 MB   | #######3   |  74% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.9996218Z 
2025-05-07T20:26:29.9996229Z 
2025-05-07T20:26:29.9996252Z 
2025-05-07T20:26:29.9996264Z 
2025-05-07T20:26:29.9996268Z 
2025-05-07T20:26:29.9996272Z 
2025-05-07T20:26:29.9996275Z 
2025-05-07T20:26:29.9996279Z 
2025-05-07T20:26:29.9996282Z 
2025-05-07T20:26:29.9996286Z 
2025-05-07T20:26:29.9999644Z 
2025-05-07T20:26:30.0590654Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########2  |  82% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.0591114Z 
2025-05-07T20:26:30.0591120Z 
2025-05-07T20:26:30.0591126Z 
2025-05-07T20:26:30.0591139Z 
2025-05-07T20:26:30.0591145Z 
2025-05-07T20:26:30.0591150Z 
2025-05-07T20:26:30.0591182Z 
2025-05-07T20:26:30.0591187Z 
2025-05-07T20:26:30.0591193Z 
2025-05-07T20:26:30.0591198Z 
2025-05-07T20:26:30.0591204Z 
2025-05-07T20:26:30.0591209Z 
2025-05-07T20:26:30.0591220Z 
2025-05-07T20:26:30.0713738Z libnvjitlink-12.6.85 | 14.9 MB   | #########2 |  93% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.0714081Z 
2025-05-07T20:26:30.0715192Z 
2025-05-07T20:26:30.1007848Z libcufft-11.3.0.4    | 156.2 MB  | ########## | 100% [A[A
2025-05-07T20:26:30.1008140Z 
2025-05-07T20:26:30.1008144Z 
2025-05-07T20:26:30.1008148Z 
2025-05-07T20:26:30.1008151Z 
2025-05-07T20:26:30.1008155Z 
2025-05-07T20:26:30.1008159Z 
2025-05-07T20:26:30.1008162Z 
2025-05-07T20:26:30.1008166Z 
2025-05-07T20:26:30.1008169Z 
2025-05-07T20:26:30.1008173Z 
2025-05-07T20:26:30.1008176Z 
2025-05-07T20:26:30.5390397Z cuda-nvcc-tools-12.6 | 23.0 MB   | #########3 |  93% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.5390744Z 
2025-05-07T20:26:30.5390748Z 
2025-05-07T20:26:30.5390752Z 
2025-05-07T20:26:30.5390755Z 
2025-05-07T20:26:30.5390800Z 
2025-05-07T20:26:30.5390803Z 
2025-05-07T20:26:30.5390807Z 
2025-05-07T20:26:30.5390810Z 
2025-05-07T20:26:30.5390813Z 
2025-05-07T20:26:30.5390827Z 
2025-05-07T20:26:30.5390831Z 
2025-05-07T20:26:30.5391832Z 
2025-05-07T20:26:30.5528071Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.5528490Z 
2025-05-07T20:26:30.5528494Z 
2025-05-07T20:26:30.5528773Z 
2025-05-07T20:26:30.5528778Z 
2025-05-07T20:26:30.5528781Z 
2025-05-07T20:26:30.5528784Z 
2025-05-07T20:26:30.5528788Z 
2025-05-07T20:26:30.5528792Z 
2025-05-07T20:26:30.5528795Z 
2025-05-07T20:26:30.5528799Z 
2025-05-07T20:26:30.5528803Z 
2025-05-07T20:26:30.5528806Z 
2025-05-07T20:26:30.5530252Z 
2025-05-07T20:26:30.5821485Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.5821925Z 
2025-05-07T20:26:30.5821929Z 
2025-05-07T20:26:30.5821933Z 
2025-05-07T20:26:30.5821936Z 
2025-05-07T20:26:30.5821940Z 
2025-05-07T20:26:30.5822212Z 
2025-05-07T20:26:30.5822216Z 
2025-05-07T20:26:30.5822219Z 
2025-05-07T20:26:30.5822223Z 
2025-05-07T20:26:30.5822226Z 
2025-05-07T20:26:30.5822229Z 
2025-05-07T20:26:30.5822233Z 
2025-05-07T20:26:30.5822245Z 
2025-05-07T20:26:30.5822249Z 
2025-05-07T20:26:30.6061057Z cuda-nvcc-dev_linux- | 10.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.6061480Z 
2025-05-07T20:26:30.6061498Z 
2025-05-07T20:26:30.6061510Z 
2025-05-07T20:26:30.6061514Z 
2025-05-07T20:26:30.6061517Z 
2025-05-07T20:26:30.6061521Z 
2025-05-07T20:26:30.6061524Z 
2025-05-07T20:26:30.6061528Z 
2025-05-07T20:26:30.6061531Z 
2025-05-07T20:26:30.6061535Z 
2025-05-07T20:26:30.6061538Z 
2025-05-07T20:26:30.6061542Z 
2025-05-07T20:26:30.6061545Z 
2025-05-07T20:26:30.6061549Z 
2025-05-07T20:26:30.6061552Z 
2025-05-07T20:26:30.6824346Z cuda-nvvm-tools-12.6 | 10.4 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.6824699Z 
2025-05-07T20:26:30.6824721Z 
2025-05-07T20:26:30.6824724Z 
2025-05-07T20:26:30.6824728Z 
2025-05-07T20:26:30.6824731Z 
2025-05-07T20:26:30.6824735Z 
2025-05-07T20:26:30.6824738Z 
2025-05-07T20:26:30.6824742Z 
2025-05-07T20:26:30.6824745Z 
2025-05-07T20:26:30.6824749Z 
2025-05-07T20:26:30.6824752Z 
2025-05-07T20:26:30.6824756Z 
2025-05-07T20:26:30.6824759Z 
2025-05-07T20:26:30.6826354Z 
2025-05-07T20:26:30.7066683Z cuda-nvcc-dev_linux- | 10.8 MB   | ###2       |  32% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.7067036Z 
2025-05-07T20:26:30.7067040Z 
2025-05-07T20:26:30.7067043Z 
2025-05-07T20:26:30.7067055Z 
2025-05-07T20:26:30.7067058Z 
2025-05-07T20:26:30.7067062Z 
2025-05-07T20:26:30.7067065Z 
2025-05-07T20:26:30.7067068Z 
2025-05-07T20:26:30.7067072Z 
2025-05-07T20:26:30.7067075Z 
2025-05-07T20:26:30.7067079Z 
2025-05-07T20:26:30.7067082Z 
2025-05-07T20:26:30.7067085Z 
2025-05-07T20:26:30.7067089Z 
2025-05-07T20:26:30.7067092Z 
2025-05-07T20:26:30.7982875Z cuda-nvvm-tools-12.6 | 10.4 MB   | ##5        |  25% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.7983246Z 
2025-05-07T20:26:30.7983250Z 
2025-05-07T20:26:30.7983254Z 
2025-05-07T20:26:30.7983257Z 
2025-05-07T20:26:30.7983261Z 
2025-05-07T20:26:30.7983265Z 
2025-05-07T20:26:30.7983269Z 
2025-05-07T20:26:30.7983273Z 
2025-05-07T20:26:30.7983276Z 
2025-05-07T20:26:30.7983280Z 
2025-05-07T20:26:30.7983283Z 
2025-05-07T20:26:30.7983287Z 
2025-05-07T20:26:30.7983290Z 
2025-05-07T20:26:30.7986782Z 
2025-05-07T20:26:30.8067759Z cuda-nvcc-dev_linux- | 10.8 MB   | ######4    |  65% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.8068175Z 
2025-05-07T20:26:30.8068179Z 
2025-05-07T20:26:30.8068183Z 
2025-05-07T20:26:30.8068186Z 
2025-05-07T20:26:30.8068190Z 
2025-05-07T20:26:30.8068194Z 
2025-05-07T20:26:30.8068197Z 
2025-05-07T20:26:30.8068201Z 
2025-05-07T20:26:30.8068204Z 
2025-05-07T20:26:30.8068208Z 
2025-05-07T20:26:30.8068211Z 
2025-05-07T20:26:30.8068215Z 
2025-05-07T20:26:30.8068225Z 
2025-05-07T20:26:30.8068229Z 
2025-05-07T20:26:30.8068241Z 
2025-05-07T20:26:30.8815230Z cuda-nvvm-tools-12.6 | 10.4 MB   | #####      |  51% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.8815661Z 
2025-05-07T20:26:30.8815677Z 
2025-05-07T20:26:30.8815682Z 
2025-05-07T20:26:30.8815687Z 
2025-05-07T20:26:30.8815692Z 
2025-05-07T20:26:30.8815697Z 
2025-05-07T20:26:30.8815702Z 
2025-05-07T20:26:30.8815707Z 
2025-05-07T20:26:30.8815713Z 
2025-05-07T20:26:30.8815947Z 
2025-05-07T20:26:30.8819137Z 
2025-05-07T20:26:30.8996211Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.8996857Z 
2025-05-07T20:26:30.8996866Z 
2025-05-07T20:26:30.8996873Z 
2025-05-07T20:26:30.8996880Z 
2025-05-07T20:26:30.8996887Z 
2025-05-07T20:26:30.8996894Z 
2025-05-07T20:26:30.8996901Z 
2025-05-07T20:26:30.8996908Z 
2025-05-07T20:26:30.8996915Z 
2025-05-07T20:26:30.8996921Z 
2025-05-07T20:26:30.8996928Z 
2025-05-07T20:26:30.8996935Z 
2025-05-07T20:26:30.8996942Z 
2025-05-07T20:26:30.8996949Z 
2025-05-07T20:26:30.9070755Z cuda-nvcc-dev_linux- | 10.8 MB   | #########4 |  95% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.9071187Z 
2025-05-07T20:26:30.9071191Z 
2025-05-07T20:26:30.9071194Z 
2025-05-07T20:26:30.9071198Z 
2025-05-07T20:26:30.9071201Z 
2025-05-07T20:26:30.9071204Z 
2025-05-07T20:26:30.9071208Z 
2025-05-07T20:26:30.9071219Z 
2025-05-07T20:26:30.9071222Z 
2025-05-07T20:26:30.9071226Z 
2025-05-07T20:26:30.9071238Z 
2025-05-07T20:26:30.9071242Z 
2025-05-07T20:26:30.9071245Z 
2025-05-07T20:26:30.9071248Z 
2025-05-07T20:26:30.9071252Z 
2025-05-07T20:26:30.9419186Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########2  |  83% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.9419546Z 
2025-05-07T20:26:30.9419550Z 
2025-05-07T20:26:30.9419553Z 
2025-05-07T20:26:30.9419557Z 
2025-05-07T20:26:30.9419561Z 
2025-05-07T20:26:30.9419564Z 
2025-05-07T20:26:30.9419568Z 
2025-05-07T20:26:30.9419571Z 
2025-05-07T20:26:30.9419575Z 
2025-05-07T20:26:30.9419578Z 
2025-05-07T20:26:30.9419582Z 
2025-05-07T20:26:30.9419598Z 
2025-05-07T20:26:30.9419602Z 
2025-05-07T20:26:30.9419605Z 
2025-05-07T20:26:30.9419616Z 
2025-05-07T20:26:30.9422009Z 
2025-05-07T20:26:30.9754567Z cuda-sanitizer-api-1 | 8.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.9754921Z 
2025-05-07T20:26:30.9754932Z 
2025-05-07T20:26:30.9754936Z 
2025-05-07T20:26:30.9754940Z 
2025-05-07T20:26:30.9754943Z 
2025-05-07T20:26:30.9754962Z 
2025-05-07T20:26:30.9754966Z 
2025-05-07T20:26:30.9754969Z 
2025-05-07T20:26:30.9754973Z 
2025-05-07T20:26:30.9754976Z 
2025-05-07T20:26:31.0318398Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.0318804Z 
2025-05-07T20:26:31.0318808Z 
2025-05-07T20:26:31.0318812Z 
2025-05-07T20:26:31.0318815Z 
2025-05-07T20:26:31.0318819Z 
2025-05-07T20:26:31.0318823Z 
2025-05-07T20:26:31.0318826Z 
2025-05-07T20:26:31.0318830Z 
2025-05-07T20:26:31.0318833Z 
2025-05-07T20:26:31.0318837Z 
2025-05-07T20:26:31.0318840Z 
2025-05-07T20:26:31.0318865Z 
2025-05-07T20:26:31.0318868Z 
2025-05-07T20:26:31.0318872Z 
2025-05-07T20:26:31.0318875Z 
2025-05-07T20:26:31.0318879Z 
2025-05-07T20:26:31.0320195Z 
2025-05-07T20:26:31.0422008Z cuda-nvvm-impl-12.6. | 7.7 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.0422361Z 
2025-05-07T20:26:31.0422365Z 
2025-05-07T20:26:31.0422368Z 
2025-05-07T20:26:31.0422373Z 
2025-05-07T20:26:31.0422388Z 
2025-05-07T20:26:31.0422392Z 
2025-05-07T20:26:31.0422395Z 
2025-05-07T20:26:31.0422399Z 
2025-05-07T20:26:31.0422402Z 
2025-05-07T20:26:31.0422405Z 
2025-05-07T20:26:31.0422409Z 
2025-05-07T20:26:31.0422412Z 
2025-05-07T20:26:31.0422416Z 
2025-05-07T20:26:31.0422419Z 
2025-05-07T20:26:31.0422423Z 
2025-05-07T20:26:31.0423926Z 
2025-05-07T20:26:31.1321893Z cuda-sanitizer-api-1 | 8.9 MB    | ###9       |  39% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.1322261Z 
2025-05-07T20:26:31.1322265Z 
2025-05-07T20:26:31.1322268Z 
2025-05-07T20:26:31.1322307Z 
2025-05-07T20:26:31.1322311Z 
2025-05-07T20:26:31.1322314Z 
2025-05-07T20:26:31.1322327Z 
2025-05-07T20:26:31.1322331Z 
2025-05-07T20:26:31.1322335Z 
2025-05-07T20:26:31.1322339Z 
2025-05-07T20:26:31.1322343Z 
2025-05-07T20:26:31.1322347Z 
2025-05-07T20:26:31.1322350Z 
2025-05-07T20:26:31.1322354Z 
2025-05-07T20:26:31.1322357Z 
2025-05-07T20:26:31.1322361Z 
2025-05-07T20:26:31.1325208Z 
2025-05-07T20:26:31.1732275Z cuda-nvvm-impl-12.6. | 7.7 MB    | ###6       |  36% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.1732664Z 
2025-05-07T20:26:31.1732668Z 
2025-05-07T20:26:31.1732671Z 
2025-05-07T20:26:31.1732675Z 
2025-05-07T20:26:31.1732678Z 
2025-05-07T20:26:31.1732682Z 
2025-05-07T20:26:31.1732685Z 
2025-05-07T20:26:31.1732689Z 
2025-05-07T20:26:31.1732692Z 
2025-05-07T20:26:31.1732695Z 
2025-05-07T20:26:31.1732699Z 
2025-05-07T20:26:31.1732702Z 
2025-05-07T20:26:31.1732706Z 
2025-05-07T20:26:31.1732709Z 
2025-05-07T20:26:31.1732713Z 
2025-05-07T20:26:31.1734414Z 
2025-05-07T20:26:31.2328281Z cuda-sanitizer-api-1 | 8.9 MB    | #######8   |  78% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.2328643Z 
2025-05-07T20:26:31.2328658Z 
2025-05-07T20:26:31.2328662Z 
2025-05-07T20:26:31.2328665Z 
2025-05-07T20:26:31.2328669Z 
2025-05-07T20:26:31.2328672Z 
2025-05-07T20:26:31.2328676Z 
2025-05-07T20:26:31.2328679Z 
2025-05-07T20:26:31.2328683Z 
2025-05-07T20:26:31.2328687Z 
2025-05-07T20:26:31.2328718Z 
2025-05-07T20:26:31.2328722Z 
2025-05-07T20:26:31.2328725Z 
2025-05-07T20:26:31.2328729Z 
2025-05-07T20:26:31.2328732Z 
2025-05-07T20:26:31.2328736Z 
2025-05-07T20:26:31.2328739Z 
2025-05-07T20:26:31.2859955Z cuda-nvvm-impl-12.6. | 7.7 MB    | #######4   |  75% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.2860306Z 
2025-05-07T20:26:31.2860310Z 
2025-05-07T20:26:31.2860313Z 
2025-05-07T20:26:31.2860317Z 
2025-05-07T20:26:31.2860320Z 
2025-05-07T20:26:31.2860324Z 
2025-05-07T20:26:31.2860328Z 
2025-05-07T20:26:31.2860331Z 
2025-05-07T20:26:31.2860358Z 
2025-05-07T20:26:31.2860361Z 
2025-05-07T20:26:31.2860365Z 
2025-05-07T20:26:31.2860368Z 
2025-05-07T20:26:31.2860372Z 
2025-05-07T20:26:31.2860375Z 
2025-05-07T20:26:31.3249666Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.3249991Z 
2025-05-07T20:26:31.3249994Z 
2025-05-07T20:26:31.3249998Z 
2025-05-07T20:26:31.3250002Z 
2025-05-07T20:26:31.3250020Z 
2025-05-07T20:26:31.3250024Z 
2025-05-07T20:26:31.3250028Z 
2025-05-07T20:26:31.3250031Z 
2025-05-07T20:26:31.3250035Z 
2025-05-07T20:26:31.3250038Z 
2025-05-07T20:26:31.3250042Z 
2025-05-07T20:26:31.3250052Z 
2025-05-07T20:26:31.3250056Z 
2025-05-07T20:26:31.3250059Z 
2025-05-07T20:26:31.3250063Z 
2025-05-07T20:26:31.3250066Z 
2025-05-07T20:26:31.3250070Z 
2025-05-07T20:26:31.3251398Z 
2025-05-07T20:26:31.3472667Z libglib-2.84.0       | 3.8 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.3473049Z 
2025-05-07T20:26:31.3473070Z 
2025-05-07T20:26:31.3473074Z 
2025-05-07T20:26:31.3473078Z 
2025-05-07T20:26:31.3473081Z 
2025-05-07T20:26:31.3473085Z 
2025-05-07T20:26:31.3473089Z 
2025-05-07T20:26:31.3473092Z 
2025-05-07T20:26:31.3473096Z 
2025-05-07T20:26:31.3473099Z 
2025-05-07T20:26:31.3473103Z 
2025-05-07T20:26:31.3473106Z 
2025-05-07T20:26:31.3473110Z 
2025-05-07T20:26:31.3473113Z 
2025-05-07T20:26:31.3478098Z 
2025-05-07T20:26:31.3879957Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.3880335Z 
2025-05-07T20:26:31.3880339Z 
2025-05-07T20:26:31.3880343Z 
2025-05-07T20:26:31.3880346Z 
2025-05-07T20:26:31.3880350Z 
2025-05-07T20:26:31.3880353Z 
2025-05-07T20:26:31.3880357Z 
2025-05-07T20:26:31.3880369Z 
2025-05-07T20:26:31.3880373Z 
2025-05-07T20:26:31.3880376Z 
2025-05-07T20:26:31.3880380Z 
2025-05-07T20:26:31.3880384Z 
2025-05-07T20:26:31.3880387Z 
2025-05-07T20:26:31.3880391Z 
2025-05-07T20:26:31.3880394Z 
2025-05-07T20:26:31.3880398Z 
2025-05-07T20:26:31.3880408Z 
2025-05-07T20:26:31.3880412Z 
2025-05-07T20:26:31.3882964Z 
2025-05-07T20:26:31.4257310Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.4257628Z 
2025-05-07T20:26:31.4257632Z 
2025-05-07T20:26:31.4257635Z 
2025-05-07T20:26:31.4257639Z 
2025-05-07T20:26:31.4257651Z 
2025-05-07T20:26:31.4257655Z 
2025-05-07T20:26:31.4257658Z 
2025-05-07T20:26:31.4257662Z 
2025-05-07T20:26:31.4257886Z 
2025-05-07T20:26:31.4257891Z 
2025-05-07T20:26:31.4257894Z 
2025-05-07T20:26:31.4257898Z 
2025-05-07T20:26:31.4257901Z 
2025-05-07T20:26:31.4257905Z 
2025-05-07T20:26:31.4257908Z 
2025-05-07T20:26:31.4257912Z 
2025-05-07T20:26:31.4257915Z 
2025-05-07T20:26:31.4257918Z 
2025-05-07T20:26:31.4884808Z libglib-2.84.0       | 3.8 MB    | ########5  |  85% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.4885149Z 
2025-05-07T20:26:31.4885153Z 
2025-05-07T20:26:31.4885157Z 
2025-05-07T20:26:31.4885160Z 
2025-05-07T20:26:31.4885165Z 
2025-05-07T20:26:31.4885429Z 
2025-05-07T20:26:31.4885433Z 
2025-05-07T20:26:31.4885436Z 
2025-05-07T20:26:31.4885440Z 
2025-05-07T20:26:31.4885443Z 
2025-05-07T20:26:31.4885447Z 
2025-05-07T20:26:31.4885450Z 
2025-05-07T20:26:31.4885454Z 
2025-05-07T20:26:31.4885457Z 
2025-05-07T20:26:31.4885468Z 
2025-05-07T20:26:31.4885475Z 
2025-05-07T20:26:31.4885480Z 
2025-05-07T20:26:31.4885485Z 
2025-05-07T20:26:31.4885704Z 
2025-05-07T20:26:31.5205095Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.5205405Z 
2025-05-07T20:26:31.5205409Z 
2025-05-07T20:26:31.5205413Z 
2025-05-07T20:26:31.5205417Z 
2025-05-07T20:26:31.5205420Z 
2025-05-07T20:26:31.5205424Z 
2025-05-07T20:26:31.5205428Z 
2025-05-07T20:26:31.5205431Z 
2025-05-07T20:26:31.5205435Z 
2025-05-07T20:26:31.5205438Z 
2025-05-07T20:26:31.5205442Z 
2025-05-07T20:26:31.5205445Z 
2025-05-07T20:26:31.5205449Z 
2025-05-07T20:26:31.5205452Z 
2025-05-07T20:26:31.5205456Z 
2025-05-07T20:26:31.5206850Z 
2025-05-07T20:26:31.5342128Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.5342503Z 
2025-05-07T20:26:31.5342507Z 
2025-05-07T20:26:31.5342510Z 
2025-05-07T20:26:31.5342514Z 
2025-05-07T20:26:31.5342517Z 
2025-05-07T20:26:31.5342521Z 
2025-05-07T20:26:31.5342524Z 
2025-05-07T20:26:31.5342528Z 
2025-05-07T20:26:31.5342539Z 
2025-05-07T20:26:31.5342543Z 
2025-05-07T20:26:31.5342546Z 
2025-05-07T20:26:31.5342560Z 
2025-05-07T20:26:31.5342566Z 
2025-05-07T20:26:31.5342571Z 
2025-05-07T20:26:31.5342576Z 
2025-05-07T20:26:31.5342579Z 
2025-05-07T20:26:31.5342582Z 
2025-05-07T20:26:31.5636116Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.5636698Z 
2025-05-07T20:26:31.5636704Z 
2025-05-07T20:26:31.5636710Z 
2025-05-07T20:26:31.5636716Z 
2025-05-07T20:26:31.5636722Z 
2025-05-07T20:26:31.5636728Z 
2025-05-07T20:26:31.5636734Z 
2025-05-07T20:26:31.5636740Z 
2025-05-07T20:26:31.5636746Z 
2025-05-07T20:26:31.5636770Z 
2025-05-07T20:26:31.5636776Z 
2025-05-07T20:26:31.5636782Z 
2025-05-07T20:26:31.5636787Z 
2025-05-07T20:26:31.5636793Z 
2025-05-07T20:26:31.5636799Z 
2025-05-07T20:26:31.5636804Z 
2025-05-07T20:26:31.5636810Z 
2025-05-07T20:26:31.5636816Z 
2025-05-07T20:26:31.6042546Z libglib-2.84.0       | 3.8 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.6042889Z 
2025-05-07T20:26:31.6042914Z 
2025-05-07T20:26:31.6042920Z 
2025-05-07T20:26:31.6042935Z 
2025-05-07T20:26:31.6042940Z 
2025-05-07T20:26:31.6042945Z 
2025-05-07T20:26:31.6042951Z 
2025-05-07T20:26:31.6042956Z 
2025-05-07T20:26:31.6042960Z 
2025-05-07T20:26:31.6042965Z 
2025-05-07T20:26:31.6042970Z 
2025-05-07T20:26:31.6042975Z 
2025-05-07T20:26:31.6042980Z 
2025-05-07T20:26:31.6042985Z 
2025-05-07T20:26:31.6042990Z 
2025-05-07T20:26:31.6042995Z 
2025-05-07T20:26:31.6043001Z 
2025-05-07T20:26:31.6043006Z 
2025-05-07T20:26:31.6047246Z 
2025-05-07T20:26:32.5933111Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.5933474Z 
2025-05-07T20:26:32.5933478Z 
2025-05-07T20:26:32.5933481Z 
2025-05-07T20:26:32.5933485Z 
2025-05-07T20:26:32.5933490Z 
2025-05-07T20:26:32.5933493Z 
2025-05-07T20:26:33.7368004Z libcusolver-11.7.1.2 | 95.8 MB   | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:33.7368324Z 
2025-05-07T20:26:33.7368328Z 
2025-05-07T20:26:33.7368332Z 
2025-05-07T20:26:33.7368613Z 
2025-05-07T20:26:33.7368635Z 
2025-05-07T20:26:34.0716038Z cuda-nvvp-12.6.80    | 109.3 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:34.0716348Z 
2025-05-07T20:26:34.0716352Z 
2025-05-07T20:26:34.0716357Z 
2025-05-07T20:26:34.0716360Z 
2025-05-07T20:26:34.0716365Z 
2025-05-07T20:26:34.0716368Z 
2025-05-07T20:26:34.0716372Z 
2025-05-07T20:26:34.0716385Z 
2025-05-07T20:26:34.3072634Z cuda-nvdisasm-12.6.7 | 47.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:26:34.3073035Z 
2025-05-07T20:26:34.3073039Z 
2025-05-07T20:26:34.3073042Z 
2025-05-07T20:26:34.3073320Z 
2025-05-07T20:26:34.3073323Z 
2025-05-07T20:26:34.3073327Z 
2025-05-07T20:26:34.3073330Z 
2025-05-07T20:26:34.5741789Z libnpp-12.3.1.54     | 93.4 MB   | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:26:34.6238356Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:26:34.6238681Z 
2025-05-07T20:26:34.6238685Z 
2025-05-07T20:26:34.6238688Z 
2025-05-07T20:26:34.6238692Z 
2025-05-07T20:26:34.6238727Z 
2025-05-07T20:26:34.6238731Z 
2025-05-07T20:26:34.6238735Z 
2025-05-07T20:26:34.6238738Z 
2025-05-07T20:26:34.6238742Z 
2025-05-07T20:26:34.6836924Z libcurand-10.3.7.77  | 39.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:34.6837276Z 
2025-05-07T20:26:34.6837282Z 
2025-05-07T20:26:34.6837286Z 
2025-05-07T20:26:34.6837292Z 
2025-05-07T20:26:34.6837297Z 
2025-05-07T20:26:34.6837304Z 
2025-05-07T20:26:34.6837309Z 
2025-05-07T20:26:34.6837315Z 
2025-05-07T20:26:34.6837321Z 
2025-05-07T20:26:34.6837340Z 
2025-05-07T20:26:34.6837346Z 
2025-05-07T20:26:34.6837396Z 
2025-05-07T20:26:34.8474057Z cuda-nvrtc-12.6.85   | 17.3 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:34.8474531Z 
2025-05-07T20:26:34.8474536Z 
2025-05-07T20:26:34.8474541Z 
2025-05-07T20:26:34.8474546Z 
2025-05-07T20:26:34.8474552Z 
2025-05-07T20:26:34.8474558Z 
2025-05-07T20:26:34.8474563Z 
2025-05-07T20:26:34.8474570Z 
2025-05-07T20:26:34.8474577Z 
2025-05-07T20:26:34.8474583Z 
2025-05-07T20:26:34.8474624Z 
2025-05-07T20:26:34.8474631Z 
2025-05-07T20:26:34.8474637Z 
2025-05-07T20:26:35.0581122Z libnvjitlink-12.6.85 | 14.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.0581478Z 
2025-05-07T20:26:35.0581482Z 
2025-05-07T20:26:35.0581485Z 
2025-05-07T20:26:35.0581489Z 
2025-05-07T20:26:35.0581492Z 
2025-05-07T20:26:35.0581496Z 
2025-05-07T20:26:35.0581499Z 
2025-05-07T20:26:35.0581503Z 
2025-05-07T20:26:35.0581506Z 
2025-05-07T20:26:35.0581519Z 
2025-05-07T20:26:35.0584382Z 
2025-05-07T20:26:35.0655418Z cuda-nvcc-tools-12.6 | 23.0 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.0655765Z 
2025-05-07T20:26:35.0655770Z 
2025-05-07T20:26:35.0655783Z 
2025-05-07T20:26:35.0655787Z 
2025-05-07T20:26:35.0655790Z 
2025-05-07T20:26:35.0655794Z 
2025-05-07T20:26:35.0655797Z 
2025-05-07T20:26:35.0655801Z 
2025-05-07T20:26:35.0655804Z 
2025-05-07T20:26:35.0655903Z 
2025-05-07T20:26:35.2402937Z gds-tools-1.11.1.6   | 37.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.2403308Z 
2025-05-07T20:26:35.2403314Z 
2025-05-07T20:26:35.2403318Z 
2025-05-07T20:26:35.2403323Z 
2025-05-07T20:26:35.2403327Z 
2025-05-07T20:26:35.2403332Z 
2025-05-07T20:26:35.2403337Z 
2025-05-07T20:26:35.2403342Z 
2025-05-07T20:26:35.2403347Z 
2025-05-07T20:26:35.2403352Z 
2025-05-07T20:26:35.2403356Z 
2025-05-07T20:26:35.2403360Z 
2025-05-07T20:26:35.2403365Z 
2025-05-07T20:26:35.2403369Z 
2025-05-07T20:26:35.2403373Z 
2025-05-07T20:26:35.2964490Z cuda-nvvm-tools-12.6 | 10.4 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.2964888Z 
2025-05-07T20:26:35.2964892Z 
2025-05-07T20:26:35.2964896Z 
2025-05-07T20:26:35.2964900Z 
2025-05-07T20:26:35.2964906Z 
2025-05-07T20:26:35.2964910Z 
2025-05-07T20:26:35.2964914Z 
2025-05-07T20:26:35.2964918Z 
2025-05-07T20:26:35.2964922Z 
2025-05-07T20:26:35.2964925Z 
2025-05-07T20:26:35.2964929Z 
2025-05-07T20:26:35.2964932Z 
2025-05-07T20:26:35.2964936Z 
2025-05-07T20:26:35.2965198Z 
2025-05-07T20:26:35.4426468Z cuda-nvcc-dev_linux- | 10.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.4426816Z 
2025-05-07T20:26:35.4426820Z 
2025-05-07T20:26:35.4426824Z 
2025-05-07T20:26:35.4426827Z 
2025-05-07T20:26:35.4426831Z 
2025-05-07T20:26:35.4426834Z 
2025-05-07T20:26:35.4426837Z 
2025-05-07T20:26:35.4426841Z 
2025-05-07T20:26:35.4426845Z 
2025-05-07T20:26:35.4426848Z 
2025-05-07T20:26:35.4426852Z 
2025-05-07T20:26:35.4426855Z 
2025-05-07T20:26:35.4426859Z 
2025-05-07T20:26:35.4426862Z 
2025-05-07T20:26:35.4427156Z 
2025-05-07T20:26:35.4427163Z 
2025-05-07T20:26:35.4525876Z cuda-sanitizer-api-1 | 8.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.4526241Z 
2025-05-07T20:26:35.4526254Z 
2025-05-07T20:26:35.4526257Z 
2025-05-07T20:26:35.4526261Z 
2025-05-07T20:26:35.4526264Z 
2025-05-07T20:26:35.4526268Z 
2025-05-07T20:26:35.4526271Z 
2025-05-07T20:26:35.4526275Z 
2025-05-07T20:26:35.4526278Z 
2025-05-07T20:26:35.4526309Z 
2025-05-07T20:26:35.4526313Z 
2025-05-07T20:26:35.4526317Z 
2025-05-07T20:26:35.4526320Z 
2025-05-07T20:26:35.4526324Z 
2025-05-07T20:26:35.4526327Z 
2025-05-07T20:26:35.4526330Z 
2025-05-07T20:26:35.4526336Z 
2025-05-07T20:26:35.6053672Z cuda-nvvm-impl-12.6. | 7.7 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.6054028Z 
2025-05-07T20:26:35.6054034Z 
2025-05-07T20:26:35.6054040Z 
2025-05-07T20:26:35.6054047Z 
2025-05-07T20:26:35.6054052Z 
2025-05-07T20:26:35.6054058Z 
2025-05-07T20:26:35.6054065Z 
2025-05-07T20:26:35.6054108Z 
2025-05-07T20:26:35.6054113Z 
2025-05-07T20:26:35.6054118Z 
2025-05-07T20:26:35.6054123Z 
2025-05-07T20:26:35.6054139Z 
2025-05-07T20:26:35.6054143Z 
2025-05-07T20:26:35.6054148Z 
2025-05-07T20:26:35.6054152Z 
2025-05-07T20:26:35.6054157Z 
2025-05-07T20:26:35.6054161Z 
2025-05-07T20:26:35.6054167Z 
2025-05-07T20:26:35.6054174Z 
2025-05-07T20:26:35.6509515Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.6509836Z 
2025-05-07T20:26:35.6509840Z 
2025-05-07T20:26:35.6509844Z 
2025-05-07T20:26:35.6509847Z 
2025-05-07T20:26:35.6509851Z 
2025-05-07T20:26:35.6509854Z 
2025-05-07T20:26:35.6509858Z 
2025-05-07T20:26:35.6509862Z 
2025-05-07T20:26:35.6509866Z 
2025-05-07T20:26:35.6509869Z 
2025-05-07T20:26:35.6509872Z 
2025-05-07T20:26:35.6509876Z 
2025-05-07T20:26:35.6509879Z 
2025-05-07T20:26:35.6509883Z 
2025-05-07T20:26:35.6509886Z 
2025-05-07T20:26:35.6509889Z 
2025-05-07T20:26:35.6509893Z 
2025-05-07T20:26:35.6509899Z 
2025-05-07T20:26:37.0222939Z libglib-2.84.0       | 3.8 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:37.0224095Z 
2025-05-07T20:26:41.7301029Z libcublas-12.6.4.1   | 256.2 MB  | ########## | 100% [A
2025-05-07T20:26:41.7309284Z nsight-compute-2024. | 443.1 MB  | ########## | 100% 
2025-05-07T20:26:41.7309685Z 
2025-05-07T20:26:41.7309691Z 
2025-05-07T20:26:41.7309696Z 
2025-05-07T20:26:41.7309701Z 
2025-05-07T20:26:41.7309730Z 
2025-05-07T20:26:41.7309735Z 
2025-05-07T20:26:41.7309740Z 
2025-05-07T20:26:41.7309745Z 
2025-05-07T20:26:41.7309750Z 
2025-05-07T20:26:41.7309755Z 
2025-05-07T20:26:41.7309762Z 
2025-05-07T20:26:41.7309768Z 
2025-05-07T20:26:41.7309773Z 
2025-05-07T20:26:41.7309778Z 
2025-05-07T20:26:41.7309783Z 
2025-05-07T20:26:41.7309788Z 
2025-05-07T20:26:41.7309792Z 
2025-05-07T20:26:41.7309798Z 
2025-05-07T20:26:41.7309802Z 
2025-05-07T20:26:41.7309925Z                       
2025-05-07T20:26:41.7310382Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7310862Z                                                      
2025-05-07T20:26:41.7311163Z 
2025-05-07T20:26:41.7311391Z                                                      [A
2025-05-07T20:26:41.7311680Z 
2025-05-07T20:26:41.7311686Z 
2025-05-07T20:26:41.7311919Z                                                      [A[A
2025-05-07T20:26:41.7312209Z 
2025-05-07T20:26:41.7312215Z 
2025-05-07T20:26:41.7312484Z 
2025-05-07T20:26:41.7312753Z                                                      [A[A[A
2025-05-07T20:26:41.7313051Z 
2025-05-07T20:26:41.7313057Z 
2025-05-07T20:26:41.7313062Z 
2025-05-07T20:26:41.7313067Z 
2025-05-07T20:26:41.7313310Z                                                      [A[A[A[A
2025-05-07T20:26:41.7313604Z 
2025-05-07T20:26:41.7313609Z 
2025-05-07T20:26:41.7313614Z 
2025-05-07T20:26:41.7313619Z 
2025-05-07T20:26:41.7313624Z 
2025-05-07T20:26:41.7314017Z                                                      [A[A[A[A[A
2025-05-07T20:26:41.7314331Z 
2025-05-07T20:26:41.7314665Z 
2025-05-07T20:26:41.7314670Z 
2025-05-07T20:26:41.7314675Z 
2025-05-07T20:26:41.7314680Z 
2025-05-07T20:26:41.7314698Z 
2025-05-07T20:26:41.7314968Z                                                      [A[A[A[A[A[A
2025-05-07T20:26:41.7315278Z 
2025-05-07T20:26:41.7315283Z 
2025-05-07T20:26:41.7315288Z 
2025-05-07T20:26:41.7315293Z 
2025-05-07T20:26:41.7315308Z 
2025-05-07T20:26:41.7315313Z 
2025-05-07T20:26:41.7315327Z 
2025-05-07T20:26:41.7315577Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:26:41.7315891Z 
2025-05-07T20:26:41.7315896Z 
2025-05-07T20:26:41.7315901Z 
2025-05-07T20:26:41.7315915Z 
2025-05-07T20:26:41.7315920Z 
2025-05-07T20:26:41.7315925Z 
2025-05-07T20:26:41.7315930Z 
2025-05-07T20:26:41.7315935Z 
2025-05-07T20:26:41.7316185Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7316499Z 
2025-05-07T20:26:41.7316504Z 
2025-05-07T20:26:41.7316509Z 
2025-05-07T20:26:41.7316526Z 
2025-05-07T20:26:41.7316531Z 
2025-05-07T20:26:41.7316536Z 
2025-05-07T20:26:41.7316542Z 
2025-05-07T20:26:41.7316548Z 
2025-05-07T20:26:41.7316553Z 
2025-05-07T20:26:41.7316803Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7317121Z 
2025-05-07T20:26:41.7317126Z 
2025-05-07T20:26:41.7317131Z 
2025-05-07T20:26:41.7317136Z 
2025-05-07T20:26:41.7317144Z 
2025-05-07T20:26:41.7317156Z 
2025-05-07T20:26:41.7317161Z 
2025-05-07T20:26:41.7317166Z 
2025-05-07T20:26:41.7317171Z 
2025-05-07T20:26:41.7317176Z 
2025-05-07T20:26:41.7317648Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7317905Z 
2025-05-07T20:26:41.7317909Z 
2025-05-07T20:26:41.7317912Z 
2025-05-07T20:26:41.7317916Z 
2025-05-07T20:26:41.7317919Z 
2025-05-07T20:26:41.7317923Z 
2025-05-07T20:26:41.7317933Z 
2025-05-07T20:26:41.7317937Z 
2025-05-07T20:26:41.7317940Z 
2025-05-07T20:26:41.7317944Z 
2025-05-07T20:26:41.7317947Z 
2025-05-07T20:26:41.7318443Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7318698Z 
2025-05-07T20:26:41.7318702Z 
2025-05-07T20:26:41.7318714Z 
2025-05-07T20:26:41.7318718Z 
2025-05-07T20:26:41.7318722Z 
2025-05-07T20:26:41.7318725Z 
2025-05-07T20:26:41.7318729Z 
2025-05-07T20:26:41.7318732Z 
2025-05-07T20:26:41.7318736Z 
2025-05-07T20:26:41.7318739Z 
2025-05-07T20:26:41.7318748Z 
2025-05-07T20:26:41.7318764Z 
2025-05-07T20:26:41.7319008Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7319279Z 
2025-05-07T20:26:41.7319283Z 
2025-05-07T20:26:41.7319286Z 
2025-05-07T20:26:41.7319295Z 
2025-05-07T20:26:41.7319299Z 
2025-05-07T20:26:41.7319302Z 
2025-05-07T20:26:41.7319306Z 
2025-05-07T20:26:41.7319309Z 
2025-05-07T20:26:41.7319313Z 
2025-05-07T20:26:41.7319316Z 
2025-05-07T20:26:41.7319320Z 
2025-05-07T20:26:41.7319323Z 
2025-05-07T20:26:41.7319327Z 
2025-05-07T20:26:41.7319761Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7320048Z 
2025-05-07T20:26:41.7320052Z 
2025-05-07T20:26:41.7320061Z 
2025-05-07T20:26:41.7320064Z 
2025-05-07T20:26:41.7320068Z 
2025-05-07T20:26:41.7320071Z 
2025-05-07T20:26:41.7320075Z 
2025-05-07T20:26:41.7320078Z 
2025-05-07T20:26:41.7320082Z 
2025-05-07T20:26:41.7320098Z 
2025-05-07T20:26:41.7320102Z 
2025-05-07T20:26:41.7320105Z 
2025-05-07T20:26:41.7320711Z 
2025-05-07T20:26:41.7320717Z 
2025-05-07T20:26:41.7320958Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7321207Z 
2025-05-07T20:26:41.7321211Z 
2025-05-07T20:26:41.7321214Z 
2025-05-07T20:26:41.7321218Z 
2025-05-07T20:26:41.7321221Z 
2025-05-07T20:26:41.7321225Z 
2025-05-07T20:26:41.7321228Z 
2025-05-07T20:26:41.7321232Z 
2025-05-07T20:26:41.7321235Z 
2025-05-07T20:26:41.7321239Z 
2025-05-07T20:26:41.7321242Z 
2025-05-07T20:26:41.7321246Z 
2025-05-07T20:26:41.7321249Z 
2025-05-07T20:26:41.7321336Z 
2025-05-07T20:26:41.7321339Z 
2025-05-07T20:26:41.7321562Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7321865Z 
2025-05-07T20:26:41.7321871Z 
2025-05-07T20:26:41.7321876Z 
2025-05-07T20:26:41.7321882Z 
2025-05-07T20:26:41.7321887Z 
2025-05-07T20:26:41.7321893Z 
2025-05-07T20:26:41.7321898Z 
2025-05-07T20:26:41.7321903Z 
2025-05-07T20:26:41.7321916Z 
2025-05-07T20:26:41.7321920Z 
2025-05-07T20:26:41.7321924Z 
2025-05-07T20:26:41.7321927Z 
2025-05-07T20:26:41.7321931Z 
2025-05-07T20:26:41.7321943Z 
2025-05-07T20:26:41.7321946Z 
2025-05-07T20:26:41.7321950Z 
2025-05-07T20:26:41.7322180Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7322442Z 
2025-05-07T20:26:41.7322459Z 
2025-05-07T20:26:41.7322464Z 
2025-05-07T20:26:41.7322469Z 
2025-05-07T20:26:41.7322474Z 
2025-05-07T20:26:41.7322479Z 
2025-05-07T20:26:41.7322484Z 
2025-05-07T20:26:41.7322489Z 
2025-05-07T20:26:41.7322503Z 
2025-05-07T20:26:41.7322509Z 
2025-05-07T20:26:41.7322514Z 
2025-05-07T20:26:41.7322519Z 
2025-05-07T20:26:41.7322524Z 
2025-05-07T20:26:41.7322529Z 
2025-05-07T20:26:41.7322534Z 
2025-05-07T20:26:41.7322538Z 
2025-05-07T20:26:41.7322543Z 
2025-05-07T20:26:41.7322826Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7323075Z 
2025-05-07T20:26:41.7323085Z 
2025-05-07T20:26:41.7323089Z 
2025-05-07T20:26:41.7323092Z 
2025-05-07T20:26:41.7323096Z 
2025-05-07T20:26:41.7323099Z 
2025-05-07T20:26:41.7323103Z 
2025-05-07T20:26:41.7323106Z 
2025-05-07T20:26:41.7323109Z 
2025-05-07T20:26:41.7323113Z 
2025-05-07T20:26:41.7323116Z 
2025-05-07T20:26:41.7323127Z 
2025-05-07T20:26:41.7323131Z 
2025-05-07T20:26:41.7323134Z 
2025-05-07T20:26:41.7323138Z 
2025-05-07T20:26:41.7323141Z 
2025-05-07T20:26:41.7323145Z 
2025-05-07T20:26:41.7323148Z 
2025-05-07T20:26:41.7323879Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7324153Z 
2025-05-07T20:26:41.7324158Z 
2025-05-07T20:26:41.7324476Z [A
2025-05-07T20:26:41.7324650Z 
2025-05-07T20:26:41.7324660Z 
2025-05-07T20:26:41.7325140Z [A[A
2025-05-07T20:26:41.7325331Z 
2025-05-07T20:26:41.7325340Z 
2025-05-07T20:26:41.7325346Z 
2025-05-07T20:26:41.7326011Z [A[A[A
2025-05-07T20:26:41.7326128Z 
2025-05-07T20:26:41.7326132Z 
2025-05-07T20:26:41.7326152Z 
2025-05-07T20:26:41.7326156Z 
2025-05-07T20:26:41.7326720Z [A[A[A[A
2025-05-07T20:26:41.7326845Z 
2025-05-07T20:26:41.7326849Z 
2025-05-07T20:26:41.7326852Z 
2025-05-07T20:26:41.7326856Z 
2025-05-07T20:26:41.7326869Z 
2025-05-07T20:26:41.7327447Z [A[A[A[A[A
2025-05-07T20:26:41.7327621Z 
2025-05-07T20:26:41.7327627Z 
2025-05-07T20:26:41.7327637Z 
2025-05-07T20:26:41.7327642Z 
2025-05-07T20:26:41.7327647Z 
2025-05-07T20:26:41.7327653Z 
2025-05-07T20:26:41.7328176Z [A[A[A[A[A[A
2025-05-07T20:26:41.7328324Z 
2025-05-07T20:26:41.7328337Z 
2025-05-07T20:26:41.7328349Z 
2025-05-07T20:26:41.7328353Z 
2025-05-07T20:26:41.7328360Z 
2025-05-07T20:26:41.7328363Z 
2025-05-07T20:26:41.7328367Z 
2025-05-07T20:26:41.7328802Z [A[A[A[A[A[A[A
2025-05-07T20:26:41.7328951Z 
2025-05-07T20:26:41.7328959Z 
2025-05-07T20:26:41.7328963Z 
2025-05-07T20:26:41.7328966Z 
2025-05-07T20:26:41.7328978Z 
2025-05-07T20:26:41.7328982Z 
2025-05-07T20:26:41.7328986Z 
2025-05-07T20:26:41.7328989Z 
2025-05-07T20:26:41.7329643Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7329816Z 
2025-05-07T20:26:41.7329828Z 
2025-05-07T20:26:41.7329832Z 
2025-05-07T20:26:41.7329835Z 
2025-05-07T20:26:41.7329839Z 
2025-05-07T20:26:41.7329842Z 
2025-05-07T20:26:41.7329846Z 
2025-05-07T20:26:41.7329849Z 
2025-05-07T20:26:41.7329853Z 
2025-05-07T20:26:41.7330172Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7330379Z 
2025-05-07T20:26:41.7330383Z 
2025-05-07T20:26:41.7330387Z 
2025-05-07T20:26:41.7330396Z 
2025-05-07T20:26:41.7330400Z 
2025-05-07T20:26:41.7330403Z 
2025-05-07T20:26:41.7330407Z 
2025-05-07T20:26:41.7330579Z 
2025-05-07T20:26:41.7330583Z 
2025-05-07T20:26:41.7330586Z 
2025-05-07T20:26:41.7330880Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7331091Z 
2025-05-07T20:26:41.7331095Z 
2025-05-07T20:26:41.7331103Z 
2025-05-07T20:26:41.7331107Z 
2025-05-07T20:26:41.7331110Z 
2025-05-07T20:26:41.7331114Z 
2025-05-07T20:26:41.7331117Z 
2025-05-07T20:26:41.7331121Z 
2025-05-07T20:26:41.7331124Z 
2025-05-07T20:26:41.7331137Z 
2025-05-07T20:26:41.7331141Z 
2025-05-07T20:26:41.7331459Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7331649Z 
2025-05-07T20:26:41.7331653Z 
2025-05-07T20:26:41.7331661Z 
2025-05-07T20:26:41.7331665Z 
2025-05-07T20:26:41.7331668Z 
2025-05-07T20:26:41.7331672Z 
2025-05-07T20:26:41.7331675Z 
2025-05-07T20:26:41.7331679Z 
2025-05-07T20:26:41.7331682Z 
2025-05-07T20:26:41.7331686Z 
2025-05-07T20:26:41.7331690Z 
2025-05-07T20:26:41.7331693Z 
2025-05-07T20:26:41.7332148Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7332366Z 
2025-05-07T20:26:41.7332384Z 
2025-05-07T20:26:41.7332388Z 
2025-05-07T20:26:41.7332391Z 
2025-05-07T20:26:41.7332395Z 
2025-05-07T20:26:41.7332398Z 
2025-05-07T20:26:41.7332402Z 
2025-05-07T20:26:41.7332405Z 
2025-05-07T20:26:41.7332409Z 
2025-05-07T20:26:41.7332412Z 
2025-05-07T20:26:41.7332415Z 
2025-05-07T20:26:41.7332419Z 
2025-05-07T20:26:41.7332422Z 
2025-05-07T20:26:41.7332813Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7333094Z 
2025-05-07T20:26:41.7333118Z 
2025-05-07T20:26:41.7333123Z 
2025-05-07T20:26:41.7333129Z 
2025-05-07T20:26:41.7333134Z 
2025-05-07T20:26:41.7333139Z 
2025-05-07T20:26:41.7333144Z 
2025-05-07T20:26:41.7333149Z 
2025-05-07T20:26:41.7333154Z 
2025-05-07T20:26:41.7333159Z 
2025-05-07T20:26:41.7333176Z 
2025-05-07T20:26:41.7333180Z 
2025-05-07T20:26:41.7333186Z 
2025-05-07T20:26:41.7333190Z 
2025-05-07T20:26:41.7333545Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7333805Z 
2025-05-07T20:26:41.7333809Z 
2025-05-07T20:26:41.7333819Z 
2025-05-07T20:26:41.7333822Z 
2025-05-07T20:26:41.7333833Z 
2025-05-07T20:26:41.7333837Z 
2025-05-07T20:26:41.7333840Z 
2025-05-07T20:26:41.7333844Z 
2025-05-07T20:26:41.7333848Z 
2025-05-07T20:26:41.7333851Z 
2025-05-07T20:26:41.7333855Z 
2025-05-07T20:26:41.7333858Z 
2025-05-07T20:26:41.7333862Z 
2025-05-07T20:26:41.7333866Z 
2025-05-07T20:26:41.7333870Z 
2025-05-07T20:26:41.7334235Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7334637Z 
2025-05-07T20:26:41.7334658Z 
2025-05-07T20:26:41.7334663Z 
2025-05-07T20:26:41.7334668Z 
2025-05-07T20:26:41.7334673Z 
2025-05-07T20:26:41.7334678Z 
2025-05-07T20:26:41.7334683Z 
2025-05-07T20:26:41.7334688Z 
2025-05-07T20:26:41.7334693Z 
2025-05-07T20:26:41.7334698Z 
2025-05-07T20:26:41.7334703Z 
2025-05-07T20:26:41.7334708Z 
2025-05-07T20:26:41.7334725Z 
2025-05-07T20:26:41.7334730Z 
2025-05-07T20:26:41.7334735Z 
2025-05-07T20:26:41.7334740Z 
2025-05-07T20:26:41.7335070Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7335387Z 
2025-05-07T20:26:41.7335412Z 
2025-05-07T20:26:41.7335427Z 
2025-05-07T20:26:41.7335432Z 
2025-05-07T20:26:41.7335437Z 
2025-05-07T20:26:41.7335442Z 
2025-05-07T20:26:41.7335447Z 
2025-05-07T20:26:41.7335452Z 
2025-05-07T20:26:41.7335457Z 
2025-05-07T20:26:41.7335462Z 
2025-05-07T20:26:41.7335467Z 
2025-05-07T20:26:41.7335472Z 
2025-05-07T20:26:41.7335477Z 
2025-05-07T20:26:41.7335482Z 
2025-05-07T20:26:41.7335487Z 
2025-05-07T20:26:41.7335492Z 
2025-05-07T20:26:41.7335654Z 
2025-05-07T20:26:41.7335903Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7336201Z 
2025-05-07T20:26:41.7336206Z 
2025-05-07T20:26:41.7336211Z 
2025-05-07T20:26:41.7336216Z 
2025-05-07T20:26:41.7336221Z 
2025-05-07T20:26:41.7336226Z 
2025-05-07T20:26:41.7336231Z 
2025-05-07T20:26:41.7336236Z 
2025-05-07T20:26:41.7336241Z 
2025-05-07T20:26:41.7336245Z 
2025-05-07T20:26:41.7336250Z 
2025-05-07T20:26:41.7336255Z 
2025-05-07T20:26:41.7336269Z 
2025-05-07T20:26:41.7336275Z 
2025-05-07T20:26:41.7336279Z 
2025-05-07T20:26:41.7336284Z 
2025-05-07T20:26:41.7336384Z 
2025-05-07T20:26:41.7336389Z 
2025-05-07T20:26:41.7337002Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7337320Z 
2025-05-07T20:26:41.7337326Z 
2025-05-07T20:26:41.7337490Z [A
2025-05-07T20:26:41.7337631Z 
2025-05-07T20:26:41.7337639Z 
2025-05-07T20:26:41.7338161Z [A[A
2025-05-07T20:26:41.7338281Z 
2025-05-07T20:26:41.7338285Z 
2025-05-07T20:26:41.7338300Z 
2025-05-07T20:26:41.7338903Z [A[A[A
2025-05-07T20:26:41.7339077Z 
2025-05-07T20:26:41.7339083Z 
2025-05-07T20:26:41.7339088Z 
2025-05-07T20:26:41.7339093Z 
2025-05-07T20:26:41.7339334Z [A[A[A[A
2025-05-07T20:26:41.7339501Z 
2025-05-07T20:26:41.7339509Z 
2025-05-07T20:26:41.7339515Z 
2025-05-07T20:26:41.7339520Z 
2025-05-07T20:26:41.7339525Z 
2025-05-07T20:26:41.7339999Z [A[A[A[A[A
2025-05-07T20:26:41.7340135Z 
2025-05-07T20:26:41.7340139Z 
2025-05-07T20:26:41.7340142Z 
2025-05-07T20:26:41.7340146Z 
2025-05-07T20:26:41.7340159Z 
2025-05-07T20:26:41.7340165Z 
2025-05-07T20:26:41.7340519Z [A[A[A[A[A[A
2025-05-07T20:26:41.7340723Z 
2025-05-07T20:26:41.7340729Z 
2025-05-07T20:26:41.7340740Z 
2025-05-07T20:26:41.7340746Z 
2025-05-07T20:26:41.7340762Z 
2025-05-07T20:26:41.7340767Z 
2025-05-07T20:26:41.7340772Z 
2025-05-07T20:26:41.7341145Z [A[A[A[A[A[A[A
2025-05-07T20:26:41.7341353Z 
2025-05-07T20:26:41.7341358Z 
2025-05-07T20:26:41.7341363Z 
2025-05-07T20:26:41.7341374Z 
2025-05-07T20:26:41.7341378Z 
2025-05-07T20:26:41.7341394Z 
2025-05-07T20:26:41.7341400Z 
2025-05-07T20:26:41.7341405Z 
2025-05-07T20:26:41.7341866Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7342085Z 
2025-05-07T20:26:41.7342091Z 
2025-05-07T20:26:41.7342096Z 
2025-05-07T20:26:41.7342102Z 
2025-05-07T20:26:41.7342107Z 
2025-05-07T20:26:41.7342112Z 
2025-05-07T20:26:41.7342117Z 
2025-05-07T20:26:41.7342139Z 
2025-05-07T20:26:41.7342144Z 
2025-05-07T20:26:41.7342346Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7342564Z 
2025-05-07T20:26:41.7342575Z 
2025-05-07T20:26:41.7342590Z 
2025-05-07T20:26:41.7342595Z 
2025-05-07T20:26:41.7342611Z 
2025-05-07T20:26:41.7342616Z 
2025-05-07T20:26:41.7342621Z 
2025-05-07T20:26:41.7342626Z 
2025-05-07T20:26:41.7342631Z 
2025-05-07T20:26:41.7342636Z 
2025-05-07T20:26:41.7342976Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7343208Z 
2025-05-07T20:26:41.7343218Z 
2025-05-07T20:26:41.7343223Z 
2025-05-07T20:26:41.7343228Z 
2025-05-07T20:26:41.7343233Z 
2025-05-07T20:26:41.7343239Z 
2025-05-07T20:26:41.7343244Z 
2025-05-07T20:26:41.7343255Z 
2025-05-07T20:26:41.7343261Z 
2025-05-07T20:26:41.7343266Z 
2025-05-07T20:26:41.7343271Z 
2025-05-07T20:26:41.7343594Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7343830Z 
2025-05-07T20:26:41.7343834Z 
2025-05-07T20:26:41.7343851Z 
2025-05-07T20:26:41.7343855Z 
2025-05-07T20:26:41.7343858Z 
2025-05-07T20:26:41.7343861Z 
2025-05-07T20:26:41.7343865Z 
2025-05-07T20:26:41.7343868Z 
2025-05-07T20:26:41.7343872Z 
2025-05-07T20:26:41.7343875Z 
2025-05-07T20:26:41.7343879Z 
2025-05-07T20:26:41.7343882Z 
2025-05-07T20:26:41.7344190Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7344445Z 
2025-05-07T20:26:41.7344454Z 
2025-05-07T20:26:41.7344458Z 
2025-05-07T20:26:41.7344462Z 
2025-05-07T20:26:41.7344465Z 
2025-05-07T20:26:41.7344469Z 
2025-05-07T20:26:41.7344472Z 
2025-05-07T20:26:41.7344476Z 
2025-05-07T20:26:41.7344479Z 
2025-05-07T20:26:41.7344483Z 
2025-05-07T20:26:41.7344486Z 
2025-05-07T20:26:41.7344490Z 
2025-05-07T20:26:41.7344493Z 
2025-05-07T20:26:41.7344955Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7345246Z 
2025-05-07T20:26:41.7345258Z 
2025-05-07T20:26:41.7345264Z 
2025-05-07T20:26:41.7345269Z 
2025-05-07T20:26:41.7345274Z 
2025-05-07T20:26:41.7345279Z 
2025-05-07T20:26:41.7345284Z 
2025-05-07T20:26:41.7345290Z 
2025-05-07T20:26:41.7345295Z 
2025-05-07T20:26:41.7345302Z 
2025-05-07T20:26:41.7345308Z 
2025-05-07T20:26:41.7345315Z 
2025-05-07T20:26:41.7345321Z 
2025-05-07T20:26:41.7345326Z 
2025-05-07T20:26:41.7345544Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7345821Z 
2025-05-07T20:26:41.7345955Z 
2025-05-07T20:26:41.7345960Z 
2025-05-07T20:26:41.7345965Z 
2025-05-07T20:26:41.7345970Z 
2025-05-07T20:26:41.7345975Z 
2025-05-07T20:26:41.7345992Z 
2025-05-07T20:26:41.7345997Z 
2025-05-07T20:26:41.7346003Z 
2025-05-07T20:26:41.7346007Z 
2025-05-07T20:26:41.7346014Z 
2025-05-07T20:26:41.7346021Z 
2025-05-07T20:26:41.7346026Z 
2025-05-07T20:26:41.7346031Z 
2025-05-07T20:26:41.7346037Z 
2025-05-07T20:26:41.7346261Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7346552Z 
2025-05-07T20:26:41.7346557Z 
2025-05-07T20:26:41.7346562Z 
2025-05-07T20:26:41.7346567Z 
2025-05-07T20:26:41.7346572Z 
2025-05-07T20:26:41.7346577Z 
2025-05-07T20:26:41.7346582Z 
2025-05-07T20:26:41.7346587Z 
2025-05-07T20:26:41.7346592Z 
2025-05-07T20:26:41.7346597Z 
2025-05-07T20:26:41.7346602Z 
2025-05-07T20:26:41.7346607Z 
2025-05-07T20:26:41.7346613Z 
2025-05-07T20:26:41.7346618Z 
2025-05-07T20:26:41.7346623Z 
2025-05-07T20:26:41.7346628Z 
2025-05-07T20:26:41.7346854Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7347151Z 
2025-05-07T20:26:41.7347156Z 
2025-05-07T20:26:41.7347161Z 
2025-05-07T20:26:41.7347167Z 
2025-05-07T20:26:41.7347171Z 
2025-05-07T20:26:41.7347176Z 
2025-05-07T20:26:41.7347181Z 
2025-05-07T20:26:41.7347186Z 
2025-05-07T20:26:41.7347191Z 
2025-05-07T20:26:41.7347205Z 
2025-05-07T20:26:41.7347210Z 
2025-05-07T20:26:41.7347215Z 
2025-05-07T20:26:41.7347220Z 
2025-05-07T20:26:41.7347225Z 
2025-05-07T20:26:41.7347237Z 
2025-05-07T20:26:41.7347242Z 
2025-05-07T20:26:41.7347256Z 
2025-05-07T20:26:41.7347470Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7347772Z 
2025-05-07T20:26:41.7347777Z 
2025-05-07T20:26:41.7347782Z 
2025-05-07T20:26:41.7347787Z 
2025-05-07T20:26:41.7347792Z 
2025-05-07T20:26:41.7347798Z 
2025-05-07T20:26:41.7347803Z 
2025-05-07T20:26:41.7347808Z 
2025-05-07T20:26:41.7347813Z 
2025-05-07T20:26:41.7347818Z 
2025-05-07T20:26:41.7347823Z 
2025-05-07T20:26:41.7347828Z 
2025-05-07T20:26:41.7347833Z 
2025-05-07T20:26:41.7347838Z 
2025-05-07T20:26:41.7347848Z 
2025-05-07T20:26:41.7347853Z 
2025-05-07T20:26:41.7347858Z 
2025-05-07T20:26:41.7347865Z 
2025-05-07T20:26:41.7348564Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7348868Z 
2025-05-07T20:26:41.7348877Z 
2025-05-07T20:26:41.7349032Z [A
2025-05-07T20:26:41.7349176Z 
2025-05-07T20:26:41.7349181Z 
2025-05-07T20:26:41.7349447Z [A[A
2025-05-07T20:26:41.7349613Z 
2025-05-07T20:26:41.7349630Z 
2025-05-07T20:26:41.7349636Z 
2025-05-07T20:26:41.7349890Z [A[A[A
2025-05-07T20:26:41.7350040Z 
2025-05-07T20:26:41.7350049Z 
2025-05-07T20:26:41.7350054Z 
2025-05-07T20:26:41.7350059Z 
2025-05-07T20:26:41.7350464Z [A[A[A[A
2025-05-07T20:26:41.7350630Z 
2025-05-07T20:26:41.7350635Z 
2025-05-07T20:26:41.7350644Z 
2025-05-07T20:26:41.7350649Z 
2025-05-07T20:26:41.7350654Z 
2025-05-07T20:26:41.7351007Z [A[A[A[A[A
2025-05-07T20:26:41.7351191Z 
2025-05-07T20:26:41.7351202Z 
2025-05-07T20:26:41.7351208Z 
2025-05-07T20:26:41.7351213Z 
2025-05-07T20:26:41.7351218Z 
2025-05-07T20:26:41.7351232Z 
2025-05-07T20:26:41.7351599Z [A[A[A[A[A[A
2025-05-07T20:26:41.7351790Z 
2025-05-07T20:26:41.7351800Z 
2025-05-07T20:26:41.7351805Z 
2025-05-07T20:26:41.7351810Z 
2025-05-07T20:26:41.7351815Z 
2025-05-07T20:26:41.7351820Z 
2025-05-07T20:26:41.7351825Z 
2025-05-07T20:26:41.7352216Z [A[A[A[A[A[A[A
2025-05-07T20:26:41.7352382Z 
2025-05-07T20:26:41.7352386Z 
2025-05-07T20:26:41.7352390Z 
2025-05-07T20:26:41.7352510Z 
2025-05-07T20:26:41.7352527Z 
2025-05-07T20:26:41.7352530Z 
2025-05-07T20:26:41.7352540Z 
2025-05-07T20:26:41.7352544Z 
2025-05-07T20:26:41.7352707Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7352902Z 
2025-05-07T20:26:41.7352906Z 
2025-05-07T20:26:41.7352918Z 
2025-05-07T20:26:41.7352921Z 
2025-05-07T20:26:41.7352925Z 
2025-05-07T20:26:41.7352929Z 
2025-05-07T20:26:41.7352932Z 
2025-05-07T20:26:41.7352936Z 
2025-05-07T20:26:41.7352942Z 
2025-05-07T20:26:41.7353278Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7353514Z 
2025-05-07T20:26:41.7353520Z 
2025-05-07T20:26:41.7353649Z 
2025-05-07T20:26:41.7353654Z 
2025-05-07T20:26:41.7353660Z 
2025-05-07T20:26:41.7353665Z 
2025-05-07T20:26:41.7353670Z 
2025-05-07T20:26:41.7353680Z 
2025-05-07T20:26:41.7353685Z 
2025-05-07T20:26:41.7353690Z 
2025-05-07T20:26:41.7353881Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7354112Z 
2025-05-07T20:26:41.7354117Z 
2025-05-07T20:26:41.7354122Z 
2025-05-07T20:26:41.7354132Z 
2025-05-07T20:26:41.7354137Z 
2025-05-07T20:26:41.7354151Z 
2025-05-07T20:26:41.7354157Z 
2025-05-07T20:26:41.7354162Z 
2025-05-07T20:26:41.7354167Z 
2025-05-07T20:26:41.7354173Z 
2025-05-07T20:26:41.7354178Z 
2025-05-07T20:26:41.7354378Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7354621Z 
2025-05-07T20:26:41.7354626Z 
2025-05-07T20:26:41.7354631Z 
2025-05-07T20:26:41.7354636Z 
2025-05-07T20:26:41.7354641Z 
2025-05-07T20:26:41.7354646Z 
2025-05-07T20:26:41.7354655Z 
2025-05-07T20:26:41.7354670Z 
2025-05-07T20:26:41.7354675Z 
2025-05-07T20:26:41.7354680Z 
2025-05-07T20:26:41.7354685Z 
2025-05-07T20:26:41.7354697Z 
2025-05-07T20:26:41.7354890Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7355207Z 
2025-05-07T20:26:41.7355213Z 
2025-05-07T20:26:41.7355218Z 
2025-05-07T20:26:41.7355223Z 
2025-05-07T20:26:41.7355228Z 
2025-05-07T20:26:41.7355233Z 
2025-05-07T20:26:41.7355238Z 
2025-05-07T20:26:41.7355243Z 
2025-05-07T20:26:41.7355248Z 
2025-05-07T20:26:41.7355253Z 
2025-05-07T20:26:41.7355258Z 
2025-05-07T20:26:41.7355272Z 
2025-05-07T20:26:41.7355277Z 
2025-05-07T20:26:41.7355511Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7355784Z 
2025-05-07T20:26:41.7355789Z 
2025-05-07T20:26:41.7355800Z 
2025-05-07T20:26:41.7355805Z 
2025-05-07T20:26:41.7355810Z 
2025-05-07T20:26:41.7355815Z 
2025-05-07T20:26:41.7355820Z 
2025-05-07T20:26:41.7355825Z 
2025-05-07T20:26:41.7355830Z 
2025-05-07T20:26:41.7355836Z 
2025-05-07T20:26:41.7355841Z 
2025-05-07T20:26:41.7355846Z 
2025-05-07T20:26:41.7355851Z 
2025-05-07T20:26:41.7355857Z 
2025-05-07T20:26:41.7356072Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7356352Z 
2025-05-07T20:26:41.7356357Z 
2025-05-07T20:26:41.7356362Z 
2025-05-07T20:26:41.7356366Z 
2025-05-07T20:26:41.7356371Z 
2025-05-07T20:26:41.7356376Z 
2025-05-07T20:26:41.7356381Z 
2025-05-07T20:26:41.7356397Z 
2025-05-07T20:26:41.7356402Z 
2025-05-07T20:26:41.7356408Z 
2025-05-07T20:26:41.7356413Z 
2025-05-07T20:26:41.7356417Z 
2025-05-07T20:26:41.7356423Z 
2025-05-07T20:26:41.7356427Z 
2025-05-07T20:26:41.7356444Z 
2025-05-07T20:26:41.7356648Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7356944Z 
2025-05-07T20:26:41.7356949Z 
2025-05-07T20:26:41.7356955Z 
2025-05-07T20:26:41.7356960Z 
2025-05-07T20:26:41.7356974Z 
2025-05-07T20:26:41.7356978Z 
2025-05-07T20:26:41.7356983Z 
2025-05-07T20:26:41.7356988Z 
2025-05-07T20:26:41.7356993Z 
2025-05-07T20:26:41.7356998Z 
2025-05-07T20:26:41.7357003Z 
2025-05-07T20:26:41.7357008Z 
2025-05-07T20:26:41.7357013Z 
2025-05-07T20:26:41.7357018Z 
2025-05-07T20:26:41.7357023Z 
2025-05-07T20:26:41.7357028Z 
2025-05-07T20:26:41.7357253Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7357542Z 
2025-05-07T20:26:41.7357547Z 
2025-05-07T20:26:41.7357552Z 
2025-05-07T20:26:41.7357557Z 
2025-05-07T20:26:41.7357563Z 
2025-05-07T20:26:41.7357568Z 
2025-05-07T20:26:41.7357573Z 
2025-05-07T20:26:41.7357578Z 
2025-05-07T20:26:41.7357584Z 
2025-05-07T20:26:41.7357589Z 
2025-05-07T20:26:41.7357595Z 
2025-05-07T20:26:41.7357610Z 
2025-05-07T20:26:41.7357729Z 
2025-05-07T20:26:41.7357736Z 
2025-05-07T20:26:41.7357741Z 
2025-05-07T20:26:41.7357746Z 
2025-05-07T20:26:41.7357750Z 
2025-05-07T20:26:41.7357986Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7358290Z 
2025-05-07T20:26:41.7358295Z 
2025-05-07T20:26:41.7358300Z 
2025-05-07T20:26:41.7358305Z 
2025-05-07T20:26:41.7358310Z 
2025-05-07T20:26:41.7358315Z 
2025-05-07T20:26:41.7358320Z 
2025-05-07T20:26:41.7358325Z 
2025-05-07T20:26:41.7358330Z 
2025-05-07T20:26:41.7358335Z 
2025-05-07T20:26:41.7358340Z 
2025-05-07T20:26:41.7358345Z 
2025-05-07T20:26:41.7358440Z 
2025-05-07T20:26:41.7358445Z 
2025-05-07T20:26:41.7358450Z 
2025-05-07T20:26:41.7358455Z 
2025-05-07T20:26:41.7358460Z 
2025-05-07T20:26:41.7358465Z 
2025-05-07T20:26:41.7358696Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7358917Z 
2025-05-07T20:26:41.7358920Z 
2025-05-07T20:26:41.7359064Z [A
2025-05-07T20:26:41.7359213Z 
2025-05-07T20:26:41.7359217Z 
2025-05-07T20:26:41.7359536Z [A[A
2025-05-07T20:26:41.7359664Z 
2025-05-07T20:26:41.7359668Z 
2025-05-07T20:26:41.7359674Z 
2025-05-07T20:26:41.7359970Z [A[A[A
2025-05-07T20:26:41.7360087Z 
2025-05-07T20:26:41.7360091Z 
2025-05-07T20:26:41.7360096Z 
2025-05-07T20:26:41.7360100Z 
2025-05-07T20:26:41.7360478Z [A[A[A[A
2025-05-07T20:26:41.7360601Z 
2025-05-07T20:26:41.7360614Z 
2025-05-07T20:26:41.7360618Z 
2025-05-07T20:26:41.7360622Z 
2025-05-07T20:26:41.7360625Z 
2025-05-07T20:26:41.7361077Z [A[A[A[A[A
2025-05-07T20:26:41.7361221Z 
2025-05-07T20:26:41.7361225Z 
2025-05-07T20:26:41.7361241Z 
2025-05-07T20:26:41.7361245Z 
2025-05-07T20:26:41.7361248Z 
2025-05-07T20:26:41.7361254Z 
2025-05-07T20:26:41.7361449Z [A[A[A[A[A[A
2025-05-07T20:26:41.7361635Z 
2025-05-07T20:26:41.7361643Z 
2025-05-07T20:26:41.7361647Z 
2025-05-07T20:26:41.7361650Z 
2025-05-07T20:26:41.7361654Z 
2025-05-07T20:26:41.7361657Z 
2025-05-07T20:26:41.7361661Z 
2025-05-07T20:26:41.7361971Z [A[A[A[A[A[A[A
2025-05-07T20:26:41.7362170Z 
2025-05-07T20:26:41.7362189Z 
2025-05-07T20:26:41.7362192Z 
2025-05-07T20:26:41.7362196Z 
2025-05-07T20:26:41.7362200Z 
2025-05-07T20:26:41.7362203Z 
2025-05-07T20:26:41.7362207Z 
2025-05-07T20:26:41.7362210Z 
2025-05-07T20:26:41.7362441Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7362639Z 
2025-05-07T20:26:41.7362651Z 
2025-05-07T20:26:41.7362657Z 
2025-05-07T20:26:41.7362662Z 
2025-05-07T20:26:41.7362667Z 
2025-05-07T20:26:41.7362672Z 
2025-05-07T20:26:41.7362677Z 
2025-05-07T20:26:41.7362683Z 
2025-05-07T20:26:41.7362688Z 
2025-05-07T20:26:41.7362944Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7363115Z 
2025-05-07T20:26:41.7363119Z 
2025-05-07T20:26:41.7363122Z 
2025-05-07T20:26:41.7363130Z 
2025-05-07T20:26:41.7363133Z 
2025-05-07T20:26:41.7363136Z 
2025-05-07T20:26:41.7363140Z 
2025-05-07T20:26:41.7363144Z 
2025-05-07T20:26:41.7363158Z 
2025-05-07T20:26:41.7363161Z 
2025-05-07T20:26:41.7363429Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7363635Z 
2025-05-07T20:26:41.7363644Z 
2025-05-07T20:26:41.7363657Z 
2025-05-07T20:26:41.7363661Z 
2025-05-07T20:26:41.7363664Z 
2025-05-07T20:26:41.7363668Z 
2025-05-07T20:26:41.7363671Z 
2025-05-07T20:26:41.7363675Z 
2025-05-07T20:26:41.7363678Z 
2025-05-07T20:26:41.7363682Z 
2025-05-07T20:26:41.7363686Z 
2025-05-07T20:26:41.7363907Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7364135Z 
2025-05-07T20:26:41.7364144Z 
2025-05-07T20:26:41.7364147Z 
2025-05-07T20:26:41.7364159Z 
2025-05-07T20:26:41.7364162Z 
2025-05-07T20:26:41.7364166Z 
2025-05-07T20:26:41.7364170Z 
2025-05-07T20:26:41.7364173Z 
2025-05-07T20:26:41.7364177Z 
2025-05-07T20:26:41.7364186Z 
2025-05-07T20:26:41.7364190Z 
2025-05-07T20:26:41.7364193Z 
2025-05-07T20:26:41.7364366Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7364559Z 
2025-05-07T20:26:41.7364577Z 
2025-05-07T20:26:41.7364581Z 
2025-05-07T20:26:41.7364584Z 
2025-05-07T20:26:41.7364588Z 
2025-05-07T20:26:41.7364591Z 
2025-05-07T20:26:41.7364595Z 
2025-05-07T20:26:41.7364598Z 
2025-05-07T20:26:41.7364602Z 
2025-05-07T20:26:41.7364712Z 
2025-05-07T20:26:41.7364717Z 
2025-05-07T20:26:41.7364720Z 
2025-05-07T20:26:41.7364724Z 
2025-05-07T20:26:41.7364870Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7365072Z 
2025-05-07T20:26:41.7365076Z 
2025-05-07T20:26:41.7365079Z 
2025-05-07T20:26:41.7365083Z 
2025-05-07T20:26:41.7365086Z 
2025-05-07T20:26:41.7365090Z 
2025-05-07T20:26:41.7365093Z 
2025-05-07T20:26:41.7365097Z 
2025-05-07T20:26:41.7365100Z 
2025-05-07T20:26:41.7365104Z 
2025-05-07T20:26:41.7365108Z 
2025-05-07T20:26:41.7365111Z 
2025-05-07T20:26:41.7365115Z 
2025-05-07T20:26:41.7365204Z 
2025-05-07T20:26:41.7365359Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7365554Z 
2025-05-07T20:26:41.7365558Z 
2025-05-07T20:26:41.7365561Z 
2025-05-07T20:26:41.7365565Z 
2025-05-07T20:26:41.7365568Z 
2025-05-07T20:26:41.7365572Z 
2025-05-07T20:26:41.7365575Z 
2025-05-07T20:26:41.7365578Z 
2025-05-07T20:26:41.7365582Z 
2025-05-07T20:26:41.7365585Z 
2025-05-07T20:26:41.7365599Z 
2025-05-07T20:26:41.7365608Z 
2025-05-07T20:26:41.7365612Z 
2025-05-07T20:26:41.7365615Z 
2025-05-07T20:26:41.7365621Z 
2025-05-07T20:26:41.7365895Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7366113Z 
2025-05-07T20:26:41.7366117Z 
2025-05-07T20:26:41.7366129Z 
2025-05-07T20:26:41.7366132Z 
2025-05-07T20:26:41.7366136Z 
2025-05-07T20:26:41.7366142Z 
2025-05-07T20:26:41.7366148Z 
2025-05-07T20:26:41.7366153Z 
2025-05-07T20:26:41.7366158Z 
2025-05-07T20:26:41.7366163Z 
2025-05-07T20:26:41.7366168Z 
2025-05-07T20:26:41.7366185Z 
2025-05-07T20:26:41.7366190Z 
2025-05-07T20:26:41.7366196Z 
2025-05-07T20:26:41.7366212Z 
2025-05-07T20:26:41.7366217Z 
2025-05-07T20:26:41.7366414Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7366641Z 
2025-05-07T20:26:41.7366646Z 
2025-05-07T20:26:41.7366659Z 
2025-05-07T20:26:41.7366664Z 
2025-05-07T20:26:41.7366670Z 
2025-05-07T20:26:41.7366675Z 
2025-05-07T20:26:41.7366680Z 
2025-05-07T20:26:41.7366685Z 
2025-05-07T20:26:41.7366690Z 
2025-05-07T20:26:41.7366696Z 
2025-05-07T20:26:41.7366709Z 
2025-05-07T20:26:41.7366714Z 
2025-05-07T20:26:41.7366720Z 
2025-05-07T20:26:41.7366725Z 
2025-05-07T20:26:41.7366730Z 
2025-05-07T20:26:41.7366735Z 
2025-05-07T20:26:41.7366740Z 
2025-05-07T20:26:41.7366935Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7367152Z 
2025-05-07T20:26:41.7367156Z 
2025-05-07T20:26:41.7367159Z 
2025-05-07T20:26:41.7367163Z 
2025-05-07T20:26:41.7367167Z 
2025-05-07T20:26:41.7367170Z 
2025-05-07T20:26:41.7367174Z 
2025-05-07T20:26:41.7367177Z 
2025-05-07T20:26:41.7367181Z 
2025-05-07T20:26:41.7367188Z 
2025-05-07T20:26:41.7367192Z 
2025-05-07T20:26:41.7367195Z 
2025-05-07T20:26:41.7367199Z 
2025-05-07T20:26:41.7367202Z 
2025-05-07T20:26:41.7367216Z 
2025-05-07T20:26:41.7367220Z 
2025-05-07T20:26:41.7367223Z 
2025-05-07T20:26:41.7367227Z 
2025-05-07T20:26:41.7367631Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7367878Z 
2025-05-07T20:26:41.7367893Z 
2025-05-07T20:26:41.7368034Z [A
2025-05-07T20:26:41.7368152Z 
2025-05-07T20:26:41.7368155Z 
2025-05-07T20:26:41.7368387Z [A[A
2025-05-07T20:26:41.7368497Z 
2025-05-07T20:26:41.7368509Z 
2025-05-07T20:26:41.7368515Z 
2025-05-07T20:26:41.7368805Z [A[A[A
2025-05-07T20:26:41.7368916Z 
2025-05-07T20:26:41.7368921Z 
2025-05-07T20:26:41.7368925Z 
2025-05-07T20:26:41.7368929Z 
2025-05-07T20:26:41.7369217Z [A[A[A[A
2025-05-07T20:26:41.7369335Z 
2025-05-07T20:26:41.7369341Z 
2025-05-07T20:26:41.7369345Z 
2025-05-07T20:26:41.7369357Z 
2025-05-07T20:26:41.7369361Z 
2025-05-07T20:26:41.7369725Z [A[A[A[A[A
2025-05-07T20:26:41.7369872Z 
2025-05-07T20:26:41.7369876Z 
2025-05-07T20:26:41.7369879Z 
2025-05-07T20:26:41.7369883Z 
2025-05-07T20:26:41.7369889Z 
2025-05-07T20:26:41.7369892Z 
2025-05-07T20:26:41.7370148Z [A[A[A[A[A[A
2025-05-07T20:26:41.7370335Z 
2025-05-07T20:26:41.7370339Z 
2025-05-07T20:26:41.7370343Z 
2025-05-07T20:26:41.7370346Z 
2025-05-07T20:26:41.7370350Z 
2025-05-07T20:26:41.7370354Z 
2025-05-07T20:26:41.7370357Z 
2025-05-07T20:26:41.7370736Z [A[A[A[A[A[A[A
2025-05-07T20:26:41.7370914Z 
2025-05-07T20:26:41.7370918Z 
2025-05-07T20:26:41.7370922Z 
2025-05-07T20:26:41.7370925Z 
2025-05-07T20:26:41.7370929Z 
2025-05-07T20:26:41.7370932Z 
2025-05-07T20:26:41.7370936Z 
2025-05-07T20:26:41.7370939Z 
2025-05-07T20:26:41.7371115Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7371286Z 
2025-05-07T20:26:41.7371290Z 
2025-05-07T20:26:41.7371293Z 
2025-05-07T20:26:41.7371301Z 
2025-05-07T20:26:41.7371304Z 
2025-05-07T20:26:41.7371308Z 
2025-05-07T20:26:41.7371311Z 
2025-05-07T20:26:41.7371315Z 
2025-05-07T20:26:41.7371417Z 
2025-05-07T20:26:41.7371612Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7371772Z 
2025-05-07T20:26:41.7371776Z 
2025-05-07T20:26:41.7371779Z 
2025-05-07T20:26:41.7371782Z 
2025-05-07T20:26:41.7371786Z 
2025-05-07T20:26:41.7371789Z 
2025-05-07T20:26:41.7371796Z 
2025-05-07T20:26:41.7371801Z 
2025-05-07T20:26:41.7371806Z 
2025-05-07T20:26:41.7371811Z 
2025-05-07T20:26:41.7372002Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7372179Z 
2025-05-07T20:26:41.7372182Z 
2025-05-07T20:26:41.7372185Z 
2025-05-07T20:26:41.7372189Z 
2025-05-07T20:26:41.7372192Z 
2025-05-07T20:26:41.7372196Z 
2025-05-07T20:26:41.7372199Z 
2025-05-07T20:26:41.7372203Z 
2025-05-07T20:26:41.7372206Z 
2025-05-07T20:26:41.7372209Z 
2025-05-07T20:26:41.7372222Z 
2025-05-07T20:26:41.7372360Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7372538Z 
2025-05-07T20:26:41.7372541Z 
2025-05-07T20:26:41.7372545Z 
2025-05-07T20:26:41.7372548Z 
2025-05-07T20:26:41.7372555Z 
2025-05-07T20:26:41.7372566Z 
2025-05-07T20:26:41.7372579Z 
2025-05-07T20:26:41.7372583Z 
2025-05-07T20:26:41.7372586Z 
2025-05-07T20:26:41.7372590Z 
2025-05-07T20:26:41.7372593Z 
2025-05-07T20:26:41.7372597Z 
2025-05-07T20:26:41.7372734Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7372982Z 
2025-05-07T20:26:41.7372987Z 
2025-05-07T20:26:41.7372990Z 
2025-05-07T20:26:41.7372998Z 
2025-05-07T20:26:41.7373002Z 
2025-05-07T20:26:41.7373005Z 
2025-05-07T20:26:41.7373009Z 
2025-05-07T20:26:41.7373019Z 
2025-05-07T20:26:41.7373022Z 
2025-05-07T20:26:41.7373026Z 
2025-05-07T20:26:41.7373029Z 
2025-05-07T20:26:41.7373032Z 
2025-05-07T20:26:41.7373036Z 
2025-05-07T20:26:41.7373177Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7373412Z 
2025-05-07T20:26:41.7373417Z 
2025-05-07T20:26:41.7373422Z 
2025-05-07T20:26:41.7373428Z 
2025-05-07T20:26:41.7373433Z 
2025-05-07T20:26:41.7373438Z 
2025-05-07T20:26:41.7373443Z 
2025-05-07T20:26:41.7373447Z 
2025-05-07T20:26:41.7373452Z 
2025-05-07T20:26:41.7373457Z 
2025-05-07T20:26:41.7373462Z 
2025-05-07T20:26:41.7373477Z 
2025-05-07T20:26:41.7373481Z 
2025-05-07T20:26:41.7373497Z 
2025-05-07T20:26:41.7374108Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7374395Z 
2025-05-07T20:26:41.7374401Z 
2025-05-07T20:26:41.7374406Z 
2025-05-07T20:26:41.7374412Z 
2025-05-07T20:26:41.7374417Z 
2025-05-07T20:26:41.7374423Z 
2025-05-07T20:26:41.7374428Z 
2025-05-07T20:26:41.7374433Z 
2025-05-07T20:26:41.7374438Z 
2025-05-07T20:26:41.7374454Z 
2025-05-07T20:26:41.7374459Z 
2025-05-07T20:26:41.7374464Z 
2025-05-07T20:26:41.7374469Z 
2025-05-07T20:26:41.7374483Z 
2025-05-07T20:26:41.7374604Z 
2025-05-07T20:26:41.7374824Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7375103Z 
2025-05-07T20:26:41.7375108Z 
2025-05-07T20:26:41.7375113Z 
2025-05-07T20:26:41.7375129Z 
2025-05-07T20:26:41.7375134Z 
2025-05-07T20:26:41.7375139Z 
2025-05-07T20:26:41.7375144Z 
2025-05-07T20:26:41.7375149Z 
2025-05-07T20:26:41.7375153Z 
2025-05-07T20:26:41.7375158Z 
2025-05-07T20:26:41.7375163Z 
2025-05-07T20:26:41.7375176Z 
2025-05-07T20:26:41.7375180Z 
2025-05-07T20:26:41.7375185Z 
2025-05-07T20:26:41.7375190Z 
2025-05-07T20:26:41.7375195Z 
2025-05-07T20:26:41.7375417Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7375712Z 
2025-05-07T20:26:41.7375717Z 
2025-05-07T20:26:41.7375723Z 
2025-05-07T20:26:41.7375728Z 
2025-05-07T20:26:41.7375733Z 
2025-05-07T20:26:41.7375738Z 
2025-05-07T20:26:41.7375743Z 
2025-05-07T20:26:41.7375883Z 
2025-05-07T20:26:41.7375889Z 
2025-05-07T20:26:41.7375894Z 
2025-05-07T20:26:41.7375899Z 
2025-05-07T20:26:41.7375904Z 
2025-05-07T20:26:41.7375909Z 
2025-05-07T20:26:41.7375914Z 
2025-05-07T20:26:41.7375919Z 
2025-05-07T20:26:41.7375924Z 
2025-05-07T20:26:41.7375943Z 
2025-05-07T20:26:41.7376176Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7376467Z 
2025-05-07T20:26:41.7376472Z 
2025-05-07T20:26:41.7376477Z 
2025-05-07T20:26:41.7376491Z 
2025-05-07T20:26:41.7376496Z 
2025-05-07T20:26:41.7376501Z 
2025-05-07T20:26:41.7376505Z 
2025-05-07T20:26:41.7376606Z 
2025-05-07T20:26:41.7376611Z 
2025-05-07T20:26:41.7376616Z 
2025-05-07T20:26:41.7376621Z 
2025-05-07T20:26:41.7376626Z 
2025-05-07T20:26:41.7376631Z 
2025-05-07T20:26:41.7376636Z 
2025-05-07T20:26:41.7376641Z 
2025-05-07T20:26:41.7376646Z 
2025-05-07T20:26:41.7376651Z 
2025-05-07T20:26:41.7376655Z 
2025-05-07T20:26:41.7376886Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7377201Z 
2025-05-07T20:26:41.7377213Z 
2025-05-07T20:26:41.7377351Z [A
2025-05-07T20:26:41.7377508Z 
2025-05-07T20:26:41.7377513Z 
2025-05-07T20:26:41.7377658Z [A[A
2025-05-07T20:26:41.7377782Z 
2025-05-07T20:26:41.7377785Z 
2025-05-07T20:26:41.7377789Z 
2025-05-07T20:26:41.7377907Z [A[A[A
2025-05-07T20:26:41.7378019Z 
2025-05-07T20:26:41.7378022Z 
2025-05-07T20:26:41.7378026Z 
2025-05-07T20:26:41.7378029Z 
2025-05-07T20:26:41.7378392Z [A[A[A[A
2025-05-07T20:26:41.7378577Z 
2025-05-07T20:26:41.7378583Z 
2025-05-07T20:26:41.7378588Z 
2025-05-07T20:26:41.7378592Z 
2025-05-07T20:26:41.7378618Z 
2025-05-07T20:26:41.7378775Z [A[A[A[A[A
2025-05-07T20:26:41.7378952Z 
2025-05-07T20:26:41.7378957Z 
2025-05-07T20:26:41.7378962Z 
2025-05-07T20:26:41.7378967Z 
2025-05-07T20:26:41.7378972Z 
2025-05-07T20:26:41.7378999Z 
2025-05-07T20:26:41.7379169Z [A[A[A[A[A[A
2025-05-07T20:26:41.7379337Z 
2025-05-07T20:26:41.7379341Z 
2025-05-07T20:26:41.7379345Z 
2025-05-07T20:26:41.7379348Z 
2025-05-07T20:26:41.7379359Z 
2025-05-07T20:26:41.7379370Z 
2025-05-07T20:26:41.7379374Z 
2025-05-07T20:26:41.7379514Z [A[A[A[A[A[A[A
2025-05-07T20:26:41.7379706Z 
2025-05-07T20:26:41.7379710Z 
2025-05-07T20:26:41.7379713Z 
2025-05-07T20:26:41.7379717Z 
2025-05-07T20:26:41.7379720Z 
2025-05-07T20:26:41.7379731Z 
2025-05-07T20:26:41.7379734Z 
2025-05-07T20:26:41.7379738Z 
2025-05-07T20:26:41.7379889Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7380106Z 
2025-05-07T20:26:41.7380111Z 
2025-05-07T20:26:41.7380117Z 
2025-05-07T20:26:41.7380122Z 
2025-05-07T20:26:41.7380137Z 
2025-05-07T20:26:41.7380142Z 
2025-05-07T20:26:41.7380155Z 
2025-05-07T20:26:41.7380160Z 
2025-05-07T20:26:41.7380165Z 
2025-05-07T20:26:41.7380349Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7380534Z 
2025-05-07T20:26:41.7380539Z 
2025-05-07T20:26:41.7380544Z 
2025-05-07T20:26:41.7380556Z 
2025-05-07T20:26:41.7380561Z 
2025-05-07T20:26:41.7380583Z 
2025-05-07T20:26:41.7380588Z 
2025-05-07T20:26:41.7380594Z 
2025-05-07T20:26:41.7380599Z 
2025-05-07T20:26:41.7380604Z 
2025-05-07T20:26:41.7380787Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7381043Z 
2025-05-07T20:26:41.7381047Z 
2025-05-07T20:26:41.7381050Z 
2025-05-07T20:26:41.7381053Z 
2025-05-07T20:26:41.7381057Z 
2025-05-07T20:26:41.7381060Z 
2025-05-07T20:26:41.7381064Z 
2025-05-07T20:26:41.7381067Z 
2025-05-07T20:26:41.7381071Z 
2025-05-07T20:26:41.7381074Z 
2025-05-07T20:26:41.7381078Z 
2025-05-07T20:26:41.7381269Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7381541Z 
2025-05-07T20:26:41.7381546Z 
2025-05-07T20:26:41.7381551Z 
2025-05-07T20:26:41.7381556Z 
2025-05-07T20:26:41.7381568Z 
2025-05-07T20:26:41.7381573Z 
2025-05-07T20:26:41.7381578Z 
2025-05-07T20:26:41.7381583Z 
2025-05-07T20:26:41.7381588Z 
2025-05-07T20:26:41.7381594Z 
2025-05-07T20:26:41.7381598Z 
2025-05-07T20:26:41.7381603Z 
2025-05-07T20:26:41.7381791Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7381980Z 
2025-05-07T20:26:41.7381984Z 
2025-05-07T20:26:41.7381987Z 
2025-05-07T20:26:41.7381991Z 
2025-05-07T20:26:41.7382428Z 
2025-05-07T20:26:41.7382437Z 
2025-05-07T20:26:41.7382442Z 
2025-05-07T20:26:41.7382447Z 
2025-05-07T20:26:41.7382452Z 
2025-05-07T20:26:41.7382457Z 
2025-05-07T20:26:41.7382462Z 
2025-05-07T20:26:41.7382467Z 
2025-05-07T20:26:41.7382486Z 
2025-05-07T20:26:41.7382673Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7382923Z 
2025-05-07T20:26:41.7382928Z 
2025-05-07T20:26:41.7382934Z 
2025-05-07T20:26:41.7382939Z 
2025-05-07T20:26:41.7382944Z 
2025-05-07T20:26:41.7382949Z 
2025-05-07T20:26:41.7382964Z 
2025-05-07T20:26:41.7382970Z 
2025-05-07T20:26:41.7382975Z 
2025-05-07T20:26:41.7383114Z 
2025-05-07T20:26:41.7383118Z 
2025-05-07T20:26:41.7383121Z 
2025-05-07T20:26:41.7383125Z 
2025-05-07T20:26:41.7383128Z 
2025-05-07T20:26:41.7383348Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7383615Z 
2025-05-07T20:26:41.7383620Z 
2025-05-07T20:26:41.7383648Z 
2025-05-07T20:26:41.7383653Z 
2025-05-07T20:26:41.7383659Z 
2025-05-07T20:26:41.7383664Z 
2025-05-07T20:26:41.7383670Z 
2025-05-07T20:26:41.7383682Z 
2025-05-07T20:26:41.7383686Z 
2025-05-07T20:26:41.7383689Z 
2025-05-07T20:26:41.7383693Z 
2025-05-07T20:26:41.7383696Z 
2025-05-07T20:26:41.7383700Z 
2025-05-07T20:26:41.7383703Z 
2025-05-07T20:26:41.7383707Z 
2025-05-07T20:26:41.7383879Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7384084Z 
2025-05-07T20:26:41.7384087Z 
2025-05-07T20:26:41.7384091Z 
2025-05-07T20:26:41.7384094Z 
2025-05-07T20:26:41.7384098Z 
2025-05-07T20:26:41.7384101Z 
2025-05-07T20:26:41.7384105Z 
2025-05-07T20:26:41.7384108Z 
2025-05-07T20:26:41.7384112Z 
2025-05-07T20:26:41.7384130Z 
2025-05-07T20:26:41.7384133Z 
2025-05-07T20:26:41.7384137Z 
2025-05-07T20:26:41.7384140Z 
2025-05-07T20:26:41.7384144Z 
2025-05-07T20:26:41.7384147Z 
2025-05-07T20:26:41.7384151Z 
2025-05-07T20:26:41.7384308Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7384527Z 
2025-05-07T20:26:41.7384530Z 
2025-05-07T20:26:41.7384534Z 
2025-05-07T20:26:41.7384537Z 
2025-05-07T20:26:41.7384541Z 
2025-05-07T20:26:41.7384549Z 
2025-05-07T20:26:41.7384553Z 
2025-05-07T20:26:41.7384557Z 
2025-05-07T20:26:41.7384560Z 
2025-05-07T20:26:41.7384564Z 
2025-05-07T20:26:41.7384567Z 
2025-05-07T20:26:41.7384570Z 
2025-05-07T20:26:41.7384574Z 
2025-05-07T20:26:41.7384577Z 
2025-05-07T20:26:41.7384581Z 
2025-05-07T20:26:41.7384584Z 
2025-05-07T20:26:41.7384588Z 
2025-05-07T20:26:41.7384756Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7384995Z 
2025-05-07T20:26:41.7384999Z 
2025-05-07T20:26:41.7385002Z 
2025-05-07T20:26:41.7385006Z 
2025-05-07T20:26:41.7385009Z 
2025-05-07T20:26:41.7385017Z 
2025-05-07T20:26:41.7385021Z 
2025-05-07T20:26:41.7385024Z 
2025-05-07T20:26:41.7385028Z 
2025-05-07T20:26:41.7385031Z 
2025-05-07T20:26:41.7385035Z 
2025-05-07T20:26:41.7385038Z 
2025-05-07T20:26:41.7385041Z 
2025-05-07T20:26:41.7385045Z 
2025-05-07T20:26:41.7385058Z 
2025-05-07T20:26:41.7385061Z 
2025-05-07T20:26:41.7385065Z 
2025-05-07T20:26:41.7385069Z 
2025-05-07T20:26:41.7385263Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.7385478Z 
2025-05-07T20:26:41.7385482Z 
2025-05-07T20:26:41.7385595Z [A
2025-05-07T20:26:41.7385701Z 
2025-05-07T20:26:41.7385705Z 
2025-05-07T20:26:41.7385812Z [A[A
2025-05-07T20:26:41.7385921Z 
2025-05-07T20:26:41.7385924Z 
2025-05-07T20:26:41.7385928Z 
2025-05-07T20:26:41.7386035Z [A[A[A
2025-05-07T20:26:41.7386155Z 
2025-05-07T20:26:41.7386159Z 
2025-05-07T20:26:41.7386162Z 
2025-05-07T20:26:41.7386165Z 
2025-05-07T20:26:41.7386273Z [A[A[A[A
2025-05-07T20:26:41.7386403Z 
2025-05-07T20:26:41.7386407Z 
2025-05-07T20:26:41.7386415Z 
2025-05-07T20:26:41.7386419Z 
2025-05-07T20:26:41.7386422Z 
2025-05-07T20:26:41.7386532Z [A[A[A[A[A
2025-05-07T20:26:41.7386660Z 
2025-05-07T20:26:41.7386663Z 
2025-05-07T20:26:41.7386667Z 
2025-05-07T20:26:41.7386677Z 
2025-05-07T20:26:41.7386680Z 
2025-05-07T20:26:41.7386684Z 
2025-05-07T20:26:41.7388480Z [A[A[A[A[A[A done
2025-05-07T20:26:41.9407674Z Preparing transaction: / - done
2025-05-07T20:26:43.1433549Z Verifying transaction: | / - \ | / - \ | / - \ done
2025-05-07T20:26:43.6504492Z Executing transaction: / - \ | / done
2025-05-07T20:26:45.8563443Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ...
2025-05-07T20:26:45.8564247Z [INSTALL] Creating symlinks: libnvToolsExt.so
2025-05-07T20:26:45.8565673Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:45.8566474Z 
2025-05-07T20:26:45.8579332Z 
2025-05-07T20:26:45.8580211Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:45.8581351Z 
2025-05-07T20:26:45.8592998Z 
2025-05-07T20:26:45.8593216Z [INSTALL] Copying nvtx3 headers ...
2025-05-07T20:26:45.8598810Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/
2025-05-07T20:26:45.8602967Z 
2025-05-07T20:26:45.8803029Z 
2025-05-07T20:26:45.8808882Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/
2025-05-07T20:26:45.8812989Z 
2025-05-07T20:26:45.8831658Z 
2025-05-07T20:26:45.8831994Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ...
2025-05-07T20:26:45.9212011Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ...
2025-05-07T20:26:47.8150202Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error)
2025-05-07T20:26:47.8792606Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs
2025-05-07T20:26:47.8793152Z 
2025-05-07T20:26:48.3104729Z 
2025-05-07T20:26:48.3112744Z [INSTALL] Setting environment variable NVML_LIB_PATH ...
2025-05-07T20:26:48.3467051Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:48.3467634Z 
2025-05-07T20:26:48.7831366Z 
2025-05-07T20:26:48.7831691Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ...
2025-05-07T20:26:48.7832891Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/"
2025-05-07T20:26:48.7833659Z 
2025-05-07T20:26:49.2164497Z 
2025-05-07T20:26:51.2615225Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h
2025-05-07T20:26:53.3195405Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so
2025-05-07T20:26:55.3649056Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:55.3650317Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:57.4121204Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:59.3206141Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc
2025-05-07T20:26:59.3206509Z 
2025-05-07T20:26:59.3895327Z [CHECK] Binary nvcc found in PATH
2025-05-07T20:27:03.2726378Z /tmp/tmpkqgbq979: line 3: clang: command not found
2025-05-07T20:27:03.2726682Z 
2025-05-07T20:27:03.2727725Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error)
2025-05-07T20:27:03.3413214Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d
2025-05-07T20:27:03.3413543Z 
2025-05-07T20:27:03.3438063Z total 36
2025-05-07T20:27:03.3438463Z drwxr-xr-x. 2 ec2-user ec2-user   191 May  7 20:26 .
2025-05-07T20:27:03.3439005Z drwxr-xr-x. 5 ec2-user ec2-user    62 May  7 20:25 ..
2025-05-07T20:27:03.3439573Z -rw-r--r--. 2 ec2-user ec2-user  3778 Jun 10  2024 activate-binutils_linux-64.sh
2025-05-07T20:27:03.3440210Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10  2024 activate-gcc_linux-64.sh
2025-05-07T20:27:03.3440920Z -rw-r--r--. 2 ec2-user ec2-user  5190 Jun 10  2024 activate-gxx_linux-64.sh
2025-05-07T20:27:03.3441586Z -rw-r--r--. 2 ec2-user ec2-user   136 Mar 27 01:27 libglib_activate.sh
2025-05-07T20:27:03.3442221Z -rw-r--r--. 2 ec2-user ec2-user   872 Nov 13 09:20 libxml2_activate.sh
2025-05-07T20:27:03.3442698Z -rw-r--r--. 2 ec2-user ec2-user  2932 Nov 20 20:32 ~cuda-nvcc_activate.sh
2025-05-07T20:27:03.3442995Z 
2025-05-07T20:27:03.3443223Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ...
2025-05-07T20:27:03.3443871Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh
2025-05-07T20:27:03.3444312Z 
2025-05-07T20:27:03.3463846Z 
2025-05-07T20:27:03.3464351Z + conda run -n build_binary c++ --version | grep -i clang
2025-05-07T20:27:03.3466214Z 
2025-05-07T20:27:05.3227854Z 
2025-05-07T20:27:05.3228593Z [BUILD] Setting prepend flags for NVCC ...
2025-05-07T20:27:05.3229342Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler"
2025-05-07T20:27:05.3229749Z 
2025-05-07T20:27:05.7520913Z 
2025-05-07T20:27:05.7521299Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS
2025-05-07T20:27:05.7521717Z 
2025-05-07T20:27:07.6502595Z -allow-unsupported-compiler
2025-05-07T20:27:07.6502898Z 
2025-05-07T20:27:07.7200015Z 
2025-05-07T20:27:07.7200898Z [INFO] Printing out all preprocessor defines in nvcc ...
2025-05-07T20:27:07.7201617Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null
2025-05-07T20:27:07.7201967Z 
2025-05-07T20:27:09.6903579Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead")))
2025-05-07T20:27:09.6904313Z #define M_PIl 3.141592653589793238462643383279502884L
2025-05-07T20:27:09.6904706Z #define _IO_CURRENTLY_PUTTING 0x800
2025-05-07T20:27:09.6905045Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig))
2025-05-07T20:27:09.6905379Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:27:09.6905652Z #define _STL_PAIR_H 1
2025-05-07T20:27:09.6905913Z #define __cpp_attributes 200809L
2025-05-07T20:27:09.6906245Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:27:09.6906998Z #define __DELETE_THROW throw()
2025-05-07T20:27:09.6907286Z #define _PTRDIFF_T_ 
2025-05-07T20:27:09.6907523Z #define M_PI_4 0.78539816339744830962
2025-05-07T20:27:09.6907819Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:27:09.6908095Z #define _IO_LEFT 02
2025-05-07T20:27:09.6908322Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:27:09.6908603Z #define _POSIX2_BC_SCALE_MAX 99
2025-05-07T20:27:09.6909046Z #define _GLIBCXX_USE_RANDOM_TR1 1
2025-05-07T20:27:09.6909656Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp)
2025-05-07T20:27:09.6910179Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:27:09.6910674Z #define RE_DUP_MAX (0x7fff)
2025-05-07T20:27:09.6910949Z #define _IOS_OUTPUT 2
2025-05-07T20:27:09.6911265Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:27:09.6911772Z #define toascii_l(c,l) __toascii_l ((c), (l))
2025-05-07T20:27:09.6912253Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:27:09.6912690Z #define _GLIBCXX_USE_FCHMOD 1
2025-05-07T20:27:09.6913129Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:27:09.6914386Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; }))
2025-05-07T20:27:09.6927869Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:27:09.6928339Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:27:09.6928768Z #define cudaTextureTypeCubemapLayered 0xFC
2025-05-07T20:27:09.6929220Z #define _T_WCHAR_ 
2025-05-07T20:27:09.6929559Z #define stdout stdout
2025-05-07T20:27:09.6930065Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
2025-05-07T20:27:09.6930639Z #define CHAR_BIT __CHAR_BIT__
2025-05-07T20:27:09.6931019Z #define __flexarr []
2025-05-07T20:27:09.6931370Z #define _GLIBCXX_HAVE_FINITEF 1
2025-05-07T20:27:09.6931839Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l))
2025-05-07T20:27:09.6932363Z #define _IO_FLAGS2_USER_WBUF 8
2025-05-07T20:27:09.6932740Z #define _MATH_H 1
2025-05-07T20:27:09.6933139Z #define cudaOccupancyDisableCachingOverride 0x01
2025-05-07T20:27:09.6933626Z #define __S64_TYPE long int
2025-05-07T20:27:09.6933984Z #define __stub_fchflags 
2025-05-07T20:27:09.6934276Z #define cudaDeviceScheduleMask 0x07
2025-05-07T20:27:09.6934687Z #define __SQUAD_TYPE long int
2025-05-07T20:27:09.6934961Z #define __INTMAX_C(c) c ## L
2025-05-07T20:27:09.6935234Z #define _BSD_SIZE_T_DEFINED_ 
2025-05-07T20:27:09.6935493Z #define NL_NMAX INT_MAX
2025-05-07T20:27:09.6935737Z #define _BITS_TIME_H 1
2025-05-07T20:27:09.6936032Z #define M_LN10l 2.302585092994045684017991454684364208L
2025-05-07T20:27:09.6936362Z #define _GLIBCXX_TXN_SAFE_DYN 
2025-05-07T20:27:09.6936681Z #define cudaStreamTailLaunch ((cudaStream_t)0x3)
2025-05-07T20:27:09.6937045Z #define M_El 2.718281828459045235360287471352662498L
2025-05-07T20:27:09.6937457Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd)
2025-05-07T20:27:09.6937831Z #define __CHAR_BIT__ 8
2025-05-07T20:27:09.6938100Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:09.6938417Z #define _PSTL_STRING_CONCAT(x,y) x #y
2025-05-07T20:27:09.6938711Z #define _GLIBCXX98_USE_C99_MATH 1
2025-05-07T20:27:09.6938982Z #define FP_NAN 0
2025-05-07T20:27:09.6939242Z #define makedev(maj,min) gnu_dev_makedev (maj, min)
2025-05-07T20:27:09.6939679Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 
2025-05-07T20:27:09.6940186Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2
2025-05-07T20:27:09.6940583Z #define __cudaCDP2GetErrorString 
2025-05-07T20:27:09.6940883Z #define SHRT_MAX __SHRT_MAX__
2025-05-07T20:27:09.6941142Z #define _GLIBCXX_X86_RDSEED 1
2025-05-07T20:27:09.6941401Z #define __SM_80_RT_H__ 
2025-05-07T20:27:09.6941626Z #define _NEW 
2025-05-07T20:27:09.6941852Z #define CLOCK_PROCESS_CPUTIME_ID 2
2025-05-07T20:27:09.6942137Z #define __UINT8_MAX__ 0xff
2025-05-07T20:27:09.6942508Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition)
2025-05-07T20:27:09.6943146Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:27:09.6943390Z #define __USE_ANSI 1
2025-05-07T20:27:09.6943676Z #define _IO_BE(expr,res) __builtin_expect ((expr), res)
2025-05-07T20:27:09.6944070Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l))
2025-05-07T20:27:09.6944440Z #define __cudaCDP2Memcpy2DAsync_ptsz 
2025-05-07T20:27:09.6944751Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:27:09.6945025Z #define __SIZEOF_PTHREAD_ATTR_T 56
2025-05-07T20:27:09.6945310Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:27:09.6945590Z #define _GLIBCXX_END_NAMESPACE_LDBL 
2025-05-07T20:27:09.6946032Z #define PIPE_BUF 4096
2025-05-07T20:27:09.6946358Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 
2025-05-07T20:27:09.6946727Z #define ADJ_TICK 0x4000
2025-05-07T20:27:09.6947012Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10)
2025-05-07T20:27:09.6947332Z #define MQ_PRIO_MAX 32768
2025-05-07T20:27:09.6947604Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4
2025-05-07T20:27:09.6947944Z #define __WAIT_INT(status) (*(int *) &(status))
2025-05-07T20:27:09.6948413Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:27:09.6948956Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01
2025-05-07T20:27:09.6949338Z #define _XOPEN_SOURCE 700
2025-05-07T20:27:09.6949601Z #define _POSIX2_BC_DIM_MAX 2048
2025-05-07T20:27:09.6949885Z #define __VECTOR_FUNCTIONS_HPP__ 
2025-05-07T20:27:09.6950180Z #define __cpp_static_assert 201411L
2025-05-07T20:27:09.6950535Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8)
2025-05-07T20:27:09.6950893Z #define _GLIBCXX_HAVE_STRXFRM_L 1
2025-05-07T20:27:09.6951187Z #define _POSIX_TTY_NAME_MAX 9
2025-05-07T20:27:09.6951476Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__
2025-05-07T20:27:09.6951785Z #define __OFF_T_MATCHES_OFF64_T 1
2025-05-07T20:27:09.6952081Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:27:09.6952392Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:09.6952756Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l))
2025-05-07T20:27:09.6953110Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:27:09.6953399Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1
2025-05-07T20:27:09.6953719Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:09.6954094Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l))
2025-05-07T20:27:09.6954466Z #define cudaNvSciSyncAttrSignal 0x1
2025-05-07T20:27:09.6954775Z #define _GLIBCXX_USE_LONG_LONG 1
2025-05-07T20:27:09.6955073Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:27:09.6955409Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:27:09.6955751Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:27:09.6956161Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:27:09.6956593Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:27:09.6956909Z #define ADJ_ESTERROR 0x0008
2025-05-07T20:27:09.6957179Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:27:09.6957478Z #define __GCC_IEC_559 2
2025-05-07T20:27:09.6957832Z #define __cpp_lib_transformation_trait_aliases 201304
2025-05-07T20:27:09.6958230Z #define _IO_flockfile(_fp) 
2025-05-07T20:27:09.6958505Z #define CLOCK_MONOTONIC_RAW 4
2025-05-07T20:27:09.6958784Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:27:09.6959057Z #define _IOFBF 0
2025-05-07T20:27:09.6959276Z #define __USE_BSD 1
2025-05-07T20:27:09.6959517Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:27:09.6959792Z #define SHRT_MIN (-SHRT_MAX - 1)
2025-05-07T20:27:09.6960061Z #define _IO_USER_LOCK 0x8000
2025-05-07T20:27:09.6960317Z #define _IO_NO_WRITES 8
2025-05-07T20:27:09.6960583Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 
2025-05-07T20:27:09.6960934Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname
2025-05-07T20:27:09.6961293Z #define _GLIBCXX_HAVE_SYS_STAT_H 1
2025-05-07T20:27:09.6961600Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ())
2025-05-07T20:27:09.6961916Z #define __cpp_binary_literals 201304L
2025-05-07T20:27:09.6962209Z #define _CPP_TYPE_TRAITS_H 1
2025-05-07T20:27:09.6962573Z #define __BEGIN_NAMESPACE_C99 
2025-05-07T20:27:09.6962843Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:27:09.6963155Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 
2025-05-07T20:27:09.6963542Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE)
2025-05-07T20:27:09.6963908Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:27:09.6964215Z #define M_PI 3.14159265358979323846
2025-05-07T20:27:09.6964526Z #define _GLIBCXX_PACKAGE_NAME "package-unused"
2025-05-07T20:27:09.6964856Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1
2025-05-07T20:27:09.6965278Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:27:09.6965592Z #define _POSIX_DELAYTIMER_MAX 32
2025-05-07T20:27:09.6965882Z #define _GLIBCXX_USE_UTIME 1
2025-05-07T20:27:09.6966153Z #define _STL_ITERATOR_BASE_FUNCS_H 1
2025-05-07T20:27:09.6966745Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr)
2025-05-07T20:27:09.6967349Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1
2025-05-07T20:27:09.6967680Z #define w_termsig __wait_terminated.__w_termsig
2025-05-07T20:27:09.6968001Z #define __FLOAT_WORD_ORDER __BYTE_ORDER
2025-05-07T20:27:09.6968304Z #define __cudaCDP2GetErrorName 
2025-05-07T20:27:09.6968585Z #define XATTR_SIZE_MAX 65536
2025-05-07T20:27:09.6968846Z #define be64toh(x) __bswap_64 (x)
2025-05-07T20:27:09.6969150Z #define __ASSERT_VOID_CAST static_cast<void>
2025-05-07T20:27:09.6969481Z #define __cpp_variadic_templates 200704L
2025-05-07T20:27:09.6969771Z #define RAND_MAX 2147483647
2025-05-07T20:27:09.6970048Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1
2025-05-07T20:27:09.6970385Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:09.6970692Z #define __SM_90_RT_H__ 
2025-05-07T20:27:09.6970935Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:27:09.6971194Z #define __COMPAR_FN_T 
2025-05-07T20:27:09.6971435Z #define __GID_T_TYPE __U32_TYPE
2025-05-07T20:27:09.6971694Z #define _IO_BAD_SEEN 0x4000
2025-05-07T20:27:09.6972180Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x)))
2025-05-07T20:27:09.6972700Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:27:09.6973038Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 
2025-05-07T20:27:09.6973405Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:27:09.6973705Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 
2025-05-07T20:27:09.6974040Z #define cudaArrayColorAttachment 0x20
2025-05-07T20:27:09.6974363Z #define __cpp_variable_templates 201304L
2025-05-07T20:27:09.6975030Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:27:09.6975586Z #define __cpp_lib_integral_constant_callable 201304
2025-05-07T20:27:09.6975919Z #define _GLIBCXX_HAVE_SINHF 1
2025-05-07T20:27:09.6976197Z #define MOD_TIMECONST ADJ_TIMECONST
2025-05-07T20:27:09.6976499Z #define __cpp_lib_result_of_sfinae 201210
2025-05-07T20:27:09.6976797Z #define __SM_30_INTRINSICS_H__ 
2025-05-07T20:27:09.6977068Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:27:09.6977335Z #define _GLIBCXX_USE_WCHAR_T 1
2025-05-07T20:27:09.6977589Z #define _GLIBCXX_MATH_H 1
2025-05-07T20:27:09.6977835Z #define __u_char_defined 
2025-05-07T20:27:09.6978150Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status))
2025-05-07T20:27:09.6978507Z #define STA_PPSERROR 0x0800
2025-05-07T20:27:09.6978761Z #define _GLIBCXX_STD_A std
2025-05-07T20:27:09.6979072Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:27:09.6979353Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 
2025-05-07T20:27:09.6979795Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type)
2025-05-07T20:27:09.6980234Z #define FP_INFINITE 1
2025-05-07T20:27:09.6980607Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:27:09.6981030Z #define _IO_pid_t __pid_t
2025-05-07T20:27:09.6981288Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:27:09.6981555Z #define __LEAF , __leaf__
2025-05-07T20:27:09.6981897Z #define PATH_MAX 4096
2025-05-07T20:27:09.6982149Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:27:09.6982518Z #define __LDBL_REDIR1(name,proto,alias) name proto
2025-05-07T20:27:09.6982864Z #define _LIMITS_H___ 
2025-05-07T20:27:09.6983090Z #define __size_t 
2025-05-07T20:27:09.6983320Z #define _GLIBCXX_HAVE_FREXPF 1
2025-05-07T20:27:09.6983860Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK)
2025-05-07T20:27:09.6984438Z #define _GLIBCXX_HAVE_FREXPL 1
2025-05-07T20:27:09.6984746Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:27:09.6985251Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:27:09.6985507Z #define _WCHAR_T_DEFINED 
2025-05-07T20:27:09.6985862Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 
2025-05-07T20:27:09.6986295Z #define MOD_STATUS ADJ_STATUS
2025-05-07T20:27:09.6986673Z #define _GLIBCXX_PURE __attribute__ ((__pure__))
2025-05-07T20:27:09.6987014Z #define _GLIBCXX_HAVE_STDINT_H 1
2025-05-07T20:27:09.6987372Z #define __SIZEOF_PTHREAD_CONDATTR_T 4
2025-05-07T20:27:09.6987653Z #define __INT8_C(c) c
2025-05-07T20:27:09.6987914Z #define __cudaCDP2GetParameterBuffer 
2025-05-07T20:27:09.6988221Z #define _GLIBCXX_HAVE_COSHF 1
2025-05-07T20:27:09.6988482Z #define _GLIBCXX_HAVE_COSHL 1
2025-05-07T20:27:09.6988743Z #define __SM_70_RT_HPP__ 
2025-05-07T20:27:09.6988996Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:27:09.6989273Z #define __cpp_variadic_using 201611L
2025-05-07T20:27:09.6989592Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:09.6989933Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:27:09.6990213Z #define __SM_61_INTRINSICS_HPP__ 
2025-05-07T20:27:09.6990483Z #define _IO_FLAGS2_MMAP 1
2025-05-07T20:27:09.6990752Z #define __cpp_capture_star_this 201603L
2025-05-07T20:27:09.6991072Z #define __cudaCDP2LaunchDeviceV2_ptsz 
2025-05-07T20:27:09.6991379Z #define _GLIBCXX_HAVE_ENDIAN_H 1
2025-05-07T20:27:09.6991754Z #define __always_inline __inline __attribute__ ((__always_inline__))
2025-05-07T20:27:09.6992141Z #define NFDBITS __NFDBITS
2025-05-07T20:27:09.6992406Z #define _PSTL_PRAGMA_FORCEINLINE 
2025-05-07T20:27:09.6992696Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1
2025-05-07T20:27:09.6993023Z #define __glibcxx_requires_sorted(_First,_Last) 
2025-05-07T20:27:09.6993349Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:27:09.6993602Z #define _GLIBCXX_SYMVER_GNU 1
2025-05-07T20:27:09.6993891Z #define w_stopval __wait_stopped.__w_stopval
2025-05-07T20:27:09.6994203Z #define STA_UNSYNC 0x0040
2025-05-07T20:27:09.6994516Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:27:09.6994938Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX
2025-05-07T20:27:09.6995306Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:27:09.6995591Z #define __cpp_if_constexpr 201606L
2025-05-07T20:27:09.6995908Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 
2025-05-07T20:27:09.6996285Z #define cudaStreamFireAndForget ((cudaStream_t)0x4)
2025-05-07T20:27:09.6996633Z #define _GLIBCXX_HAVE_WCHAR_H 1
2025-05-07T20:27:09.6996943Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO
2025-05-07T20:27:09.6997281Z #define __daddr_t_defined 
2025-05-07T20:27:09.6997532Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:27:09.6997796Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1
2025-05-07T20:27:09.6998114Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1
2025-05-07T20:27:09.6998623Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800))
2025-05-07T20:27:09.6999109Z #define _ACRTIMP 
2025-05-07T20:27:09.6999338Z #define _IO_EOF_SEEN 0x10
2025-05-07T20:27:09.6999605Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1
2025-05-07T20:27:09.6999892Z #define _IOS_BIN 128
2025-05-07T20:27:09.7000245Z #define __fortify_function __extern_always_inline __attribute_artificial__
2025-05-07T20:27:09.7000667Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:27:09.7000947Z #define UNDERFLOW 4
2025-05-07T20:27:09.7001166Z #define NAME_MAX 255
2025-05-07T20:27:09.7001511Z #define SCHAR_MAX __SCHAR_MAX__
2025-05-07T20:27:09.7001787Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:27:09.7002064Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:27:09.7002365Z #define _IO_UNIFIED_JUMPTABLES 1
2025-05-07T20:27:09.7002747Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:27:09.7003136Z #define __ptr_t void *
2025-05-07T20:27:09.7003376Z #define M_E 2.7182818284590452354
2025-05-07T20:27:09.7003656Z #define cudaSurfaceType1D 0x01
2025-05-07T20:27:09.7003917Z #define __USE_ISOCXX11 1
2025-05-07T20:27:09.7004289Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:27:09.7004617Z #define cudaDeviceBlockingSync 0x04
2025-05-07T20:27:09.7004919Z #define CLOCK_MONOTONIC_COARSE 6
2025-05-07T20:27:09.7005205Z #define _GLIBCXX_OS_DEFINES 1
2025-05-07T20:27:09.7005504Z #define _GLIBCXX_NODISCARD [[__nodiscard__]]
2025-05-07T20:27:09.7005828Z #define cudaSurfaceType2D 0x02
2025-05-07T20:27:09.7006091Z #define __linux 1
2025-05-07T20:27:09.7006333Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:27:09.7006614Z #define cudaDeviceMask 0xff
2025-05-07T20:27:09.7006880Z #define _GLIBCXX_END_NAMESPACE_ALGO 
2025-05-07T20:27:09.7007180Z #define __CUDA_API_VER_MAJOR__ 12
2025-05-07T20:27:09.7007463Z #define htobe16(x) __bswap_16 (x)
2025-05-07T20:27:09.7007748Z #define HUGE_VALF (__builtin_huge_valf())
2025-05-07T20:27:09.7008067Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:27:09.7008374Z #define HUGE_VALL (__builtin_huge_vall())
2025-05-07T20:27:09.7008669Z #define _BITS_TYPES_H 1
2025-05-07T20:27:09.7008960Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL)
2025-05-07T20:27:09.7009307Z #define _IO_cleanup_region_end(_Doit) 
2025-05-07T20:27:09.7009606Z #define cudaSurfaceType3D 0x03
2025-05-07T20:27:09.7009890Z #define _GLIBCXX_HAVE_SYS_TIME_H 1
2025-05-07T20:27:09.7010186Z #define __cudaGet_blockIdx() blockIdx
2025-05-07T20:27:09.7010481Z #define _IO_DONT_CLOSE 0100000
2025-05-07T20:27:09.7011283Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib)
2025-05-07T20:27:09.7012128Z #define cudaHostRegisterDefault 0x00
2025-05-07T20:27:09.7012445Z #define __unix 1
2025-05-07T20:27:09.7012688Z #define MATH_ERRNO 1
2025-05-07T20:27:09.7012935Z #define _GLIBCXX_STDIO_SEEK_END 2
2025-05-07T20:27:09.7013216Z #define _GLIBCXX_USE_FCHMODAT 1
2025-05-07T20:27:09.7013482Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:27:09.7013773Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:27:09.7014071Z #define __UID_T_TYPE __U32_TYPE
2025-05-07T20:27:09.7014358Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1
2025-05-07T20:27:09.7015014Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10))
2025-05-07T20:27:09.7015494Z #define __nv_pure__ __location__(nv_pure)
2025-05-07T20:27:09.7015797Z #define CUDARTAPI_CDECL 
2025-05-07T20:27:09.7016055Z #define _PSTL_USAGE_WARNINGS 0
2025-05-07T20:27:09.7016331Z #define _GLIBCXX98_USE_C99_COMPLEX 1
2025-05-07T20:27:09.7016626Z #define __cpp_lib_void_t 201411
2025-05-07T20:27:09.7016886Z #define _POSIX_AIO_MAX 1
2025-05-07T20:27:09.7017123Z #define __SIZE_T 
2025-05-07T20:27:09.7017377Z #define isgraph_l(c,l) __isgraph_l ((c), (l))
2025-05-07T20:27:09.7017699Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0
2025-05-07T20:27:09.7018000Z #define _POSIX_PIPE_BUF 512
2025-05-07T20:27:09.7018272Z #define _GLIBCXX_HAVE_STRTOLD 1
2025-05-07T20:27:09.7018537Z #define _ATFILE_SOURCE 1
2025-05-07T20:27:09.7018922Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false)
2025-05-07T20:27:09.7019368Z #define __WAIT_STATUS void *
2025-05-07T20:27:09.7019638Z #define __MATH_FUNCTIONS_H__ 
2025-05-07T20:27:09.7019969Z #define _GLIBCXX_HAVE_WCSTOF 1
2025-05-07T20:27:09.7020299Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:27:09.7020592Z #define _GLIBCXX_HAVE_LC_MESSAGES 1
2025-05-07T20:27:09.7020976Z #define __WINT_MIN__ 0U
2025-05-07T20:27:09.7021553Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L)
2025-05-07T20:27:09.7022204Z #define isdigit_l(c,l) __isdigit_l ((c), (l))
2025-05-07T20:27:09.7022500Z #define WUNTRACED 2
2025-05-07T20:27:09.7022742Z #define _GLIBCXX_HAVE_SQRTF 1
2025-05-07T20:27:09.7023032Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8
2025-05-07T20:27:09.7023319Z #define NZERO 20
2025-05-07T20:27:09.7023545Z #define _GLIBCXX_HAVE_MEMALIGN 1
2025-05-07T20:27:09.7023833Z #define _PSTL_PRAGMA(x) _Pragma(#x)
2025-05-07T20:27:09.7024236Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT
2025-05-07T20:27:09.7024527Z #define MOD_CLKB ADJ_TICK
2025-05-07T20:27:09.7024788Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:27:09.7025080Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:27:09.7025354Z #define __DEVICE_FUNCTIONS_H__ 
2025-05-07T20:27:09.7025993Z #define SCHAR_MIN (-SCHAR_MAX - 1)
2025-05-07T20:27:09.7026279Z #define EXIT_FAILURE 1
2025-05-07T20:27:09.7026510Z #define ADJ_MAXERROR 0x0004
2025-05-07T20:27:09.7026775Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:27:09.7027047Z #define _SIZE_T_DEFINED_ 
2025-05-07T20:27:09.7027295Z #define _POSIX_AIO_LISTIO_MAX 2
2025-05-07T20:27:09.7027582Z #define __cudaCDP2DeviceGetLimit 
2025-05-07T20:27:09.7027927Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW
2025-05-07T20:27:09.7028293Z #define __cudaCDP2FuncGetAttributes 
2025-05-07T20:27:09.7028585Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:27:09.7028840Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:27:09.7029118Z #define __USING_NAMESPACE_STD(name) 
2025-05-07T20:27:09.7029411Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1
2025-05-07T20:27:09.7029721Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:27:09.7030015Z #define SEEK_DATA 3
2025-05-07T20:27:09.7030241Z #define __KERNEL_STRICT_NAMES 
2025-05-07T20:27:09.7030537Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_))
2025-05-07T20:27:09.7030967Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0)
2025-05-07T20:27:09.7031359Z #define _FUNCTEXCEPT_H 1
2025-05-07T20:27:09.7031610Z #define __INT64_C(c) c ## L
2025-05-07T20:27:09.7031881Z #define __NTH(fct) __LEAF_ATTR fct throw ()
2025-05-07T20:27:09.7032222Z #define _GLIBCXX_CONST __attribute__ ((__const__))
2025-05-07T20:27:09.7032542Z #define _GLIBCXX_HAVE_LINK 1
2025-05-07T20:27:09.7032819Z #define cudaNvSciSyncAttrWait 0x2
2025-05-07T20:27:09.7033120Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:27:09.7033421Z #define STA_PPSWANDER 0x0400
2025-05-07T20:27:09.7033682Z #define __INT_WCHAR_T_H 
2025-05-07T20:27:09.7033920Z #define WSTOPPED 2
2025-05-07T20:27:09.7034148Z #define _POSIX_THREAD_THREADS_MAX 64
2025-05-07T20:27:09.7034435Z #define _POSIX_MQ_OPEN_MAX 8
2025-05-07T20:27:09.7034691Z #define FP_NORMAL 4
2025-05-07T20:27:09.7034928Z #define __cudaCDP2LaunchDevice_ptsz 
2025-05-07T20:27:09.7035218Z #define _BITS_TIMEX_H 1
2025-05-07T20:27:09.7035455Z #define _POSIX_LINK_MAX 8
2025-05-07T20:27:09.7035713Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1
2025-05-07T20:27:09.7036002Z #define _GLIBCXX_HAVE_ATAN2F 1
2025-05-07T20:27:09.7036275Z #define cudaTextureType1D 0x01
2025-05-07T20:27:09.7036544Z #define _GLIBCXX_HAVE_ATAN2L 1
2025-05-07T20:27:09.7036812Z #define COLL_WEIGHTS_MAX 255
2025-05-07T20:27:09.7037086Z #define __isascii(c) (((c) & ~0x7f) == 0)
2025-05-07T20:27:09.7037391Z #define __toascii(c) ((c) & 0x7f)
2025-05-07T20:27:09.7037822Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b)))
2025-05-07T20:27:09.7038284Z #define _IO_MAGIC 0xFBAD0000
2025-05-07T20:27:09.7038559Z #define _GLIBCXX_USE_SENDFILE 1
2025-05-07T20:27:09.7038816Z #define _POSIX_SOURCE 1
2025-05-07T20:27:09.7039072Z #define cudaTextureType2D 0x02
2025-05-07T20:27:09.7039341Z #define _PTR_TRAITS_H 1
2025-05-07T20:27:09.7039607Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE)
2025-05-07T20:27:09.7039930Z #define _GLIBCXX_HAVE_POWF 1
2025-05-07T20:27:09.7040447Z #define _POSIX2_BC_STRING_MAX 1000
2025-05-07T20:27:09.7040776Z #define __attribute_used__ __attribute__ ((__used__))
2025-05-07T20:27:09.7041120Z #define cudaTextureType3D 0x03
2025-05-07T20:27:09.7041395Z #define _STDIO_USES_IOSTREAM 
2025-05-07T20:27:09.7041651Z #define CLOCK_REALTIME 0
2025-05-07T20:27:09.7041900Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:27:09.7042179Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:27:09.7042518Z #define __cpp_aligned_new 201606L
2025-05-07T20:27:09.7042815Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:27:09.7043095Z #define cudaEventBlockingSync 0x01
2025-05-07T20:27:09.7043572Z #define _GLIBCXX_HAVE_TANL 1
2025-05-07T20:27:09.7043844Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1
2025-05-07T20:27:09.7044183Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1
2025-05-07T20:27:09.7044494Z #define _GLIBCXX_USE_C99_FENV_TR1 1
2025-05-07T20:27:09.7044786Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:27:09.7045043Z #define __GLIBC__ 2
2025-05-07T20:27:09.7045267Z #define __END_DECLS }
2025-05-07T20:27:09.7045524Z #define FP_ILOGB0 (-2147483647 - 1)
2025-05-07T20:27:09.7045884Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:27:09.7046272Z #define __CONCAT(x,y) x ## y
2025-05-07T20:27:09.7046526Z #define WCONTINUED 8
2025-05-07T20:27:09.7046752Z #define __STDC_HOSTED__ 1
2025-05-07T20:27:09.7047015Z #define _GLIBCXX_HAVE_ARPA_INET_H 1
2025-05-07T20:27:09.7047289Z #define _ALLOCA_H 1
2025-05-07T20:27:09.7047514Z #define __host__ __location__(host)
2025-05-07T20:27:09.7048062Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg)))
2025-05-07T20:27:09.7048527Z #define __SLONG32_TYPE int
2025-05-07T20:27:09.7102181Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1
2025-05-07T20:27:09.7102518Z #define _SYS_SELECT_H 1
2025-05-07T20:27:09.7102780Z #define _IO_LINE_BUF 0x200
2025-05-07T20:27:09.7103030Z #define _IOS_NOCREATE 32
2025-05-07T20:27:09.7103280Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:27:09.7103550Z #define __cudaGet_warpSize() warpSize
2025-05-07T20:27:09.7103869Z #define __SSIZE_T_TYPE __SWORD_TYPE
2025-05-07T20:27:09.7104162Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0
2025-05-07T20:27:09.7104459Z #define __global__ __location__(global)
2025-05-07T20:27:09.7104756Z #define __GNU_LIBRARY__ 6
2025-05-07T20:27:09.7105013Z #define __cpp_decltype_auto 201304L
2025-05-07T20:27:09.7105294Z #define __DBL_DIG__ 15
2025-05-07T20:27:09.7105532Z #define TIME_UTC 1
2025-05-07T20:27:09.7105749Z #define __FLT32_DIG__ 6
2025-05-07T20:27:09.7106079Z #define __forceinline__ __inline__ __attribute__((always_inline))
2025-05-07T20:27:09.7106518Z #define cudaHostAllocWriteCombined 0x04
2025-05-07T20:27:09.7106948Z #define cudaDeviceScheduleAuto 0x00
2025-05-07T20:27:09.7107287Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l))
2025-05-07T20:27:09.7107601Z #define _G_BUFSIZ 8192
2025-05-07T20:27:09.7107918Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:27:09.7108299Z #define cudaTextureTypeCubemap 0x0C
2025-05-07T20:27:09.7108684Z #define __cudaCDP2GetDevice 
2025-05-07T20:27:09.7108990Z #define __cudaCDP2PeekAtLastError 
2025-05-07T20:27:09.7109283Z #define STA_CLOCKERR 0x1000
2025-05-07T20:27:09.7109539Z #define __GXX_WEAK__ 1
2025-05-07T20:27:09.7109803Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:09.7110112Z #define _GLIBCXX_HAVE_ISNANF 1
2025-05-07T20:27:09.7110378Z #define __SHRT_WIDTH__ 16
2025-05-07T20:27:09.7110676Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304
2025-05-07T20:27:09.7111031Z #define _GLIBCXX_BITS_SPECFUN_H 1
2025-05-07T20:27:09.7111312Z #define _GLIBCXX_HAVE_ISNANL 1
2025-05-07T20:27:09.7111605Z #define isblank_l(c,l) __isblank_l ((c), (l))
2025-05-07T20:27:09.7111907Z #define _G_config_h 1
2025-05-07T20:27:09.7112177Z #define M_LOG2El 1.442695040888963407359924681001892137L
2025-05-07T20:27:09.7112563Z #define ADJ_OFFSET_SINGLESHOT 0x8001
2025-05-07T20:27:09.7112865Z #define _GCC_WCHAR_T 
2025-05-07T20:27:09.7113103Z #define TMP_MAX 238328
2025-05-07T20:27:09.7113346Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:27:09.7113875Z #define __DEVICE_TYPES_H__ 
2025-05-07T20:27:09.7114143Z #define __DEV_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:09.7114416Z #define _EXT_NUMERIC_TRAITS 1
2025-05-07T20:27:09.7114690Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 
2025-05-07T20:27:09.7114978Z #define _IO_SKIPWS 01
2025-05-07T20:27:09.7115379Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000
2025-05-07T20:27:09.7115847Z #define _IO_SCIENTIFIC 04000
2025-05-07T20:27:09.7116115Z #define _GLIBCXX_HAVE_STRING_H 1
2025-05-07T20:27:09.7116444Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:27:09.7116955Z #define cudaDeviceScheduleSpin 0x01
2025-05-07T20:27:09.7117329Z #define __nonnull(params) __attribute__ ((__nonnull__ params))
2025-05-07T20:27:09.7117691Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:27:09.7117952Z #define le32toh(x) (x)
2025-05-07T20:27:09.7118184Z #define _SIZE_T_DEFINED 
2025-05-07T20:27:09.7118438Z #define _GLIBCXX_HAVE_XLOCALE_H 1
2025-05-07T20:27:09.7118782Z #define cudaArraySparsePropertiesSingleMipTail 0x1
2025-05-07T20:27:09.7119141Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:27:09.7119543Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0)
2025-05-07T20:27:09.7119962Z #define _GLIBCXX_HAVE_FMODL 1
2025-05-07T20:27:09.7120232Z #define _GLIBCXX_HAVE_POLL 1
2025-05-07T20:27:09.7120494Z #define __SM_32_INTRINSICS_H__ 
2025-05-07T20:27:09.7120754Z #define _POSIX_NAME_MAX 14
2025-05-07T20:27:09.7121035Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:27:09.7121570Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter)
2025-05-07T20:27:09.7122084Z #define _GLIBCXX_USE_CLOCK_REALTIME 1
2025-05-07T20:27:09.7122394Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:27:09.7122745Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG)
2025-05-07T20:27:09.7123067Z #define _WCHAR_T_ 
2025-05-07T20:27:09.7123297Z #define _GLIBCXX_FAST_MATH 0
2025-05-07T20:27:09.7123671Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:27:09.7124075Z #define RTSIG_MAX 32
2025-05-07T20:27:09.7124294Z #define _STDDEF_H 
2025-05-07T20:27:09.7124530Z #define CU_UUID_HAS_BEEN_DEFINED 
2025-05-07T20:27:09.7124808Z #define _VA_LIST_DEFINED 
2025-05-07T20:27:09.7125060Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:27:09.7126607Z #define __glibcxx_requires_non_empty_range(_First,_Last) 
2025-05-07T20:27:09.7127033Z #define __grid_constant__ __location__(grid_constant)
2025-05-07T20:27:09.7127364Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:27:09.7127665Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" {
2025-05-07T20:27:09.7128130Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L)
2025-05-07T20:27:09.7128661Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B))
2025-05-07T20:27:09.7129024Z #define __SIZEOF_PTHREAD_COND_T 48
2025-05-07T20:27:09.7129345Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 
2025-05-07T20:27:09.7129669Z #define __unix__ 1
2025-05-07T20:27:09.7129894Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:09.7130179Z #define __INT_WIDTH__ 32
2025-05-07T20:27:09.7130422Z #define __SIZEOF_LONG__ 8
2025-05-07T20:27:09.7130653Z #define _IONBF 2
2025-05-07T20:27:09.7131102Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib)
2025-05-07T20:27:09.7131881Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)
2025-05-07T20:27:09.7132447Z #define __STDC_IEC_559__ 1
2025-05-07T20:27:09.7132728Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:27:09.7133000Z #define __UINT16_C(c) c
2025-05-07T20:27:09.7133241Z #define M_2_PI 0.63661977236758134308
2025-05-07T20:27:09.7133511Z #define STA_DEL 0x0020
2025-05-07T20:27:09.7133754Z #define __CUDACC_VER_MINOR__ 6
2025-05-07T20:27:09.7134008Z #define __id_t_defined 
2025-05-07T20:27:09.7134528Z #define w_retcode __wait_terminated.__w_retcode
2025-05-07T20:27:09.7135109Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base)
2025-05-07T20:27:09.7135543Z #define _GLIBCXX_HAVE_MODFF 1
2025-05-07T20:27:09.7135811Z #define _GLIBCXX_HAVE_MODFL 1
2025-05-07T20:27:09.7136064Z #define __DECIMAL_DIG__ 21
2025-05-07T20:27:09.7136321Z #define _POSIX2_RE_DUP_MAX 255
2025-05-07T20:27:09.7136586Z #define __USE_FORTIFY_LEVEL 0
2025-05-07T20:27:09.7136845Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:27:09.7137110Z #define SING 2
2025-05-07T20:27:09.7137324Z #define STA_FREQHOLD 0x0080
2025-05-07T20:27:09.7137732Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:09.7138040Z #define cudaStreamDefault 0x00
2025-05-07T20:27:09.7138427Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:27:09.7138887Z #define _GLIBCXX_HAVE_HYPOTL 1
2025-05-07T20:27:09.7139164Z #define _GLIBCXX_HAVE_SYS_UIO_H 1
2025-05-07T20:27:09.7139439Z #define __gnu_linux__ 1
2025-05-07T20:27:09.7139678Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:27:09.7139941Z #define _LARGEFILE_SOURCE 1
2025-05-07T20:27:09.7140197Z #define MAX_INPUT 255
2025-05-07T20:27:09.7140430Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:27:09.7140764Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l))
2025-05-07T20:27:09.7141140Z #define __glibcxx_requires_heap(_First,_Last) 
2025-05-07T20:27:09.7141466Z #define _GLIBCXX_CPU_DEFINES 1
2025-05-07T20:27:09.7141787Z #define _GLIBCXX_HAVE_POLL_H 1
2025-05-07T20:27:09.7142195Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__))
2025-05-07T20:27:09.7142685Z #define _IO_SHOWPOS 02000
2025-05-07T20:27:09.7143009Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1
2025-05-07T20:27:09.7143380Z #define _Mfloat_ float
2025-05-07T20:27:09.7143642Z #define __glibcxx_requires_cond(_Cond,_Msg) 
2025-05-07T20:27:09.7143948Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:27:09.7144239Z #define DELAYTIMER_MAX 2147483647
2025-05-07T20:27:09.7144742Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0)
2025-05-07T20:27:09.7145246Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:27:09.7145518Z #define _GLIBCXX98_USE_C99_STDIO 1
2025-05-07T20:27:09.7145847Z #define cudaKernelNodeAttrID cudaLaunchAttributeID
2025-05-07T20:27:09.7146210Z #define __glibcxx_class_requires2(_a,_b,_c) 
2025-05-07T20:27:09.7146503Z #define __USE_ISOC11 1
2025-05-07T20:27:09.7146738Z #define _BSD_SIZE_T_ 
2025-05-07T20:27:09.7146967Z #define ADJ_MICRO 0x1000
2025-05-07T20:27:09.7147216Z #define _GLIBCXX_HAVE_FABSF 1
2025-05-07T20:27:09.7147478Z #define _GLIBCXX_HAVE_FABSL 1
2025-05-07T20:27:09.7147775Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd)
2025-05-07T20:27:09.7148091Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:27:09.7148413Z #define __attribute_const__ __attribute__ ((__const__))
2025-05-07T20:27:09.7148756Z #define __THROW throw ()
2025-05-07T20:27:09.7149023Z #define __cudaGet_gridDim() gridDim
2025-05-07T20:27:09.7149317Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:09.7149682Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 
2025-05-07T20:27:09.7150047Z #define htobe32(x) __bswap_32 (x)
2025-05-07T20:27:09.7150326Z #define _GLIBCXX_HAVE_POWL 1
2025-05-07T20:27:09.7150597Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:27:09.7150869Z #define __GLIBC_HAVE_LONG_LONG 1
2025-05-07T20:27:09.7151134Z #define L_tmpnam 20
2025-05-07T20:27:09.7151366Z #define ___int_wchar_t_h 
2025-05-07T20:27:09.7151723Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status))
2025-05-07T20:27:09.7152123Z #define isascii(c) __isascii (c)
2025-05-07T20:27:09.7152390Z #define _T_PTRDIFF 
2025-05-07T20:27:09.7152701Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp)
2025-05-07T20:27:09.7153076Z #define toascii(c) __toascii (c)
2025-05-07T20:27:09.7153340Z #define __GNUC__ 11
2025-05-07T20:27:09.7153600Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE
2025-05-07T20:27:09.7154044Z #define __GXX_RTTI 1
2025-05-07T20:27:09.7154269Z #define __pie__ 2
2025-05-07T20:27:09.7154484Z #define __MMX__ 1
2025-05-07T20:27:09.7154714Z #define __cudaCDP2Malloc 
2025-05-07T20:27:09.7154971Z #define __timespec_defined 1
2025-05-07T20:27:09.7155225Z #define L_ctermid 9
2025-05-07T20:27:09.7155457Z #define __OFF64_T_TYPE __SQUAD_TYPE
2025-05-07T20:27:09.7155759Z #define __cudaCDP2GetParameterBufferV2 
2025-05-07T20:27:09.7156155Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER)
2025-05-07T20:27:09.7156534Z #define _BITS_POSIX2_LIM_H 1
2025-05-07T20:27:09.7156883Z #define _GLIBCXX98_USE_C99_STDLIB 1
2025-05-07T20:27:09.7157180Z #define cudaMemAttachGlobal 0x01
2025-05-07T20:27:09.7157495Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp)
2025-05-07T20:27:09.7157807Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:27:09.7158074Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:27:09.7158516Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1)
2025-05-07T20:27:09.7159273Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:27:09.7159879Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE
2025-05-07T20:27:09.7160191Z #define __USE_SVID 1
2025-05-07T20:27:09.7160442Z #define __constant__ __location__(constant)
2025-05-07T20:27:09.7160751Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1
2025-05-07T20:27:09.7161051Z #define __device__ __location__(device)
2025-05-07T20:27:09.7161379Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1
2025-05-07T20:27:09.7161715Z #define _GLIBCXX_RES_LIMITS 1
2025-05-07T20:27:09.7161973Z #define M_1_PI 0.31830988618379067154
2025-05-07T20:27:09.7162265Z #define CUDART_DEVICE __device__
2025-05-07T20:27:09.7162612Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW
2025-05-07T20:27:09.7162976Z #define M_PI_2 1.57079632679489661923
2025-05-07T20:27:09.7163261Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:27:09.7163637Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02
2025-05-07T20:27:09.7164019Z #define __STDC_UTF_16__ 1
2025-05-07T20:27:09.7164271Z #define LONG_MAX __LONG_MAX__
2025-05-07T20:27:09.7164640Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136)
2025-05-07T20:27:09.7165062Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4
2025-05-07T20:27:09.7165380Z #define _POSIX_HOST_NAME_MAX 255
2025-05-07T20:27:09.7165651Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:27:09.7165917Z #define NGROUPS_MAX 65536
2025-05-07T20:27:09.7166166Z #define _GLIBCXX_NAMESPACE_LDBL 
2025-05-07T20:27:09.7166437Z #define __USE_ISOC95 1
2025-05-07T20:27:09.7166664Z #define _TIME_H 1
2025-05-07T20:27:09.7166924Z #define M_LOG10El 0.434294481903251827651128918916605082L
2025-05-07T20:27:09.7167289Z #define __USE_ISOC99 1
2025-05-07T20:27:09.7167695Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname)
2025-05-07T20:27:09.7168135Z #define HOST_NAME_MAX 64
2025-05-07T20:27:09.7168400Z #define _POSIX_SEM_NSEMS_MAX 256
2025-05-07T20:27:09.7168665Z #define _IOS_ATEND 4
2025-05-07T20:27:09.7168895Z #define __SM_35_INTRINSICS_H__ 
2025-05-07T20:27:09.7169228Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status))
2025-05-07T20:27:09.7169640Z #define cudaStreamAttrValue cudaLaunchAttributeValue
2025-05-07T20:27:09.7169985Z #define _GLIBCXX_HAVE_S_ISREG 1
2025-05-07T20:27:09.7170273Z #define cudaSurfaceTypeCubemap 0x0C
2025-05-07T20:27:09.7170599Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:27:09.7170930Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:27:09.7171190Z #define _STDIO_H 1
2025-05-07T20:27:09.7171599Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type)
2025-05-07T20:27:09.7172082Z #define _GLIBCXX_PREDEFINED_OPS_H 1
2025-05-07T20:27:09.7172453Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:27:09.7172840Z #define _G_IO_IO_FILE_VERSION 0x20001
2025-05-07T20:27:09.7173139Z #define _POSIX_SIGQUEUE_MAX 32
2025-05-07T20:27:09.7173561Z #define _GLIBCXX_HAVE_GETS 1
2025-05-07T20:27:09.7173839Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1
2025-05-07T20:27:09.7174137Z #define __cpp_raw_strings 200710L
2025-05-07T20:27:09.7174442Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:09.7174895Z #define _GLIBCXX_HAVE_VFWSCANF 1
2025-05-07T20:27:09.7175170Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:27:09.7175453Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L
2025-05-07T20:27:09.7175753Z #define _GLIBCXX_STDIO_EOF -1
2025-05-07T20:27:09.7176026Z #define __SIZEOF_PTHREAD_MUTEX_T 40
2025-05-07T20:27:09.7176410Z #define __CHANNEL_DESCRIPTOR_H__ 
2025-05-07T20:27:09.7176764Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
2025-05-07T20:27:09.7177139Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:27:09.7177384Z #define __USE_XOPEN 1
2025-05-07T20:27:09.7177623Z #define __SIZEOF_PTHREAD_RWLOCK_T 56
2025-05-07T20:27:09.7178067Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:27:09.7178524Z #define __USE_XOPEN2K 1
2025-05-07T20:27:09.7178765Z #define _PSTL_UDR_PRESENT 1
2025-05-07T20:27:09.7179026Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:27:09.7179325Z #define _GLIBCXX_HAVE_COSF 1
2025-05-07T20:27:09.7179597Z #define __cpp_fold_expressions 201603L
2025-05-07T20:27:09.7180113Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2)
2025-05-07T20:27:09.7180644Z #define NL_LANGMAX _POSIX2_LINE_MAX
2025-05-07T20:27:09.7180928Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:27:09.7181281Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 
2025-05-07T20:27:09.7181680Z #define __DADDR_T_TYPE __S32_TYPE
2025-05-07T20:27:09.7182064Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01
2025-05-07T20:27:09.7182465Z #define __END_NAMESPACE_C99 
2025-05-07T20:27:09.7182735Z #define __glibcxx_integral_traps true
2025-05-07T20:27:09.7183025Z #define _POSIX_PATH_MAX 256
2025-05-07T20:27:09.7183286Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:27:09.7183540Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:27:09.7183810Z #define _ISOC11_SOURCE 1
2025-05-07T20:27:09.7184063Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1
2025-05-07T20:27:09.7184348Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:27:09.7184646Z #define _GLIBCXX_HAVE_QUICK_EXIT 1
2025-05-07T20:27:09.7185013Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 
2025-05-07T20:27:09.7185398Z #define LONG_MIN (-LONG_MAX - 1L)
2025-05-07T20:27:09.7185675Z #define _GLIBCXX_HAVE_SINCOSF 1
2025-05-07T20:27:09.7185937Z #define _IO_UNITBUF 020000
2025-05-07T20:27:09.7186190Z #define _GLIBCXX_HAVE_SINCOSL 1
2025-05-07T20:27:09.7186453Z #define __FD_SETSIZE 1024
2025-05-07T20:27:09.7186706Z #define getc(_fp) _IO_getc (_fp)
2025-05-07T20:27:09.7186983Z #define be32toh(x) __bswap_32 (x)
2025-05-07T20:27:09.7187324Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused"
2025-05-07T20:27:09.7187686Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:27:09.7187962Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:27:09.7188269Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l))
2025-05-07T20:27:09.7188597Z #define _GLIBCXX_HAVE_GETIPINFO 1
2025-05-07T20:27:09.7188875Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:27:09.7189173Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l))
2025-05-07T20:27:09.7189514Z #define _WCHAR_T_DEFINED_ 
2025-05-07T20:27:09.7189805Z #define cudaIpcMemLazyEnablePeerAccess 0x01
2025-05-07T20:27:09.7190130Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1
2025-05-07T20:27:09.7190423Z #define __INO_T_MATCHES_INO64_T 1
2025-05-07T20:27:09.7190708Z #define __USE_POSIX199506 1
2025-05-07T20:27:09.7190958Z #define _FEATURES_H 1
2025-05-07T20:27:09.7191197Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:27:09.7191591Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM))
2025-05-07T20:27:09.7192017Z #define __stub_getmsg 
2025-05-07T20:27:09.7192245Z #define _IO_FIXED 010000
2025-05-07T20:27:09.7192615Z #define __cpp_lib_addressof_constexpr 201603
2025-05-07T20:27:09.7192931Z #define _GLIBCXX11_USE_C99_STDIO 1
2025-05-07T20:27:09.7193198Z #define __stub_setlogin 
2025-05-07T20:27:09.7193436Z #define __stub_fattach 
2025-05-07T20:27:09.7193676Z #define __cplusplus 201703L
2025-05-07T20:27:09.7193940Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:27:09.7194227Z #define _STRUCT_TIMEVAL 1
2025-05-07T20:27:09.7194481Z #define INFINITY (__builtin_inff())
2025-05-07T20:27:09.7194751Z #define _IO_UNBUFFERED 2
2025-05-07T20:27:09.7195236Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy
2025-05-07T20:27:09.7195855Z #define _IO_INTERNAL 010
2025-05-07T20:27:09.7196103Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:27:09.7196434Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue
2025-05-07T20:27:09.7196796Z #define __dev_t_defined 
2025-05-07T20:27:09.7197036Z #define __DEPRECATED 1
2025-05-07T20:27:09.7197262Z #define __S32_TYPE int
2025-05-07T20:27:09.7197513Z #define __cpp_rvalue_references 200610L
2025-05-07T20:27:09.7197820Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:27:09.7198076Z #define _IO_fpos_t _G_fpos_t
2025-05-07T20:27:09.7198334Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:27:09.7198937Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout
2025-05-07T20:27:09.7199573Z #define _G_HAVE_MREMAP 1
2025-05-07T20:27:09.7199885Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:27:09.7200235Z #define OVERFLOW 3
2025-05-07T20:27:09.7200474Z #define __toascii_l(c,l) ((l), __toascii (c))
2025-05-07T20:27:09.7200853Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:27:09.7201234Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:09.7201583Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11
2025-05-07T20:27:09.7201918Z #define __SSE2_MATH__ 1
2025-05-07T20:27:09.7202167Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:27:09.7202485Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:09.7202795Z #define _IO_STDIO_H 
2025-05-07T20:27:09.7203055Z #define PDP_ENDIAN __PDP_ENDIAN
2025-05-07T20:27:09.7203355Z #define isspace_l(c,l) __isspace_l ((c), (l))
2025-05-07T20:27:09.7203677Z #define __cudaCDP2Memcpy2DAsync 
2025-05-07T20:27:09.7203981Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:09.7204295Z #define _GLIBCXX_HAVE_STRERROR_R 1
2025-05-07T20:27:09.7204561Z #define __amd64 1
2025-05-07T20:27:09.7204788Z #define _POSIX_TZNAME_MAX 6
2025-05-07T20:27:09.7205064Z #define __cudaCDP2Memset3DAsync 
2025-05-07T20:27:09.7205351Z #define __SYSCALL_WORDSIZE 64
2025-05-07T20:27:09.7205643Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1
2025-05-07T20:27:09.7205952Z #define _EXT_TYPE_TRAITS 1
2025-05-07T20:27:09.7206224Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1
2025-05-07T20:27:09.7206522Z #define _POSIX_RE_DUP_MAX 255
2025-05-07T20:27:09.7206786Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:27:09.7207044Z #define __bounded 
2025-05-07T20:27:09.7207271Z #define __USECONDS_T_TYPE __U32_TYPE
2025-05-07T20:27:09.7207569Z #define _IO_DELETE_DONT_CLOSE 0x40
2025-05-07T20:27:09.7207858Z #define __BEGIN_NAMESPACE_STD 
2025-05-07T20:27:09.7208127Z #define _PTRDIFF_T_DECLARED 
2025-05-07T20:27:09.7208412Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:09.7208734Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f)
2025-05-07T20:27:09.7209150Z #define cudaStreamAttributePriority cudaLaunchAttributePriority
2025-05-07T20:27:09.7209558Z #define _GLIBCXX_HAVE_NETDB_H 1
2025-05-07T20:27:09.7209835Z #define __SM_20_INTRINSICS_HPP__ 
2025-05-07T20:27:09.7210177Z #define __cpp_lib_has_unique_object_representations 201606
2025-05-07T20:27:09.7210529Z #define STA_PLL 0x0001
2025-05-07T20:27:09.7210771Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:27:09.7211044Z #define __GNUG__ 11
2025-05-07T20:27:09.7211271Z #define _GLIBCXX_USE_GET_NPROCS 1
2025-05-07T20:27:09.7211543Z #define _T_WCHAR 
2025-05-07T20:27:09.7211778Z #define __cudaCDP2GetDeviceCount 
2025-05-07T20:27:09.7212064Z #define __specialization_static 
2025-05-07T20:27:09.7212493Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:27:09.7212816Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:27:09.7213073Z #define cudaArraySparse 0x40
2025-05-07T20:27:09.7213340Z #define STA_PPSFREQ 0x0002
2025-05-07T20:27:09.7213590Z #define __GLIBCXX__ 20230528
2025-05-07T20:27:09.7213870Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_))
2025-05-07T20:27:09.7214177Z #define _WCHAR_T 
2025-05-07T20:27:09.7214401Z #define __cudaCDP2Free 
2025-05-07T20:27:09.7215218Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0)
2025-05-07T20:27:09.7216035Z #define __cpp_nsdmi 200809L
2025-05-07T20:27:09.7216463Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0)
2025-05-07T20:27:09.7216922Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:27:09.7217209Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:27:09.7217480Z #define cudaArrayCubemap 0x04
2025-05-07T20:27:09.7217816Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:27:09.7218176Z #define _GLIBCXX_UTILITY 1
2025-05-07T20:27:09.7218417Z #define __NO_CTYPE 1
2025-05-07T20:27:09.7218648Z #define __stub_bdflush 
2025-05-07T20:27:09.7219025Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter)
2025-05-07T20:27:09.7219451Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 
2025-05-07T20:27:09.7219765Z #define _GLIBCXX_STDC_HEADERS 1
2025-05-07T20:27:09.7220036Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:27:09.7220320Z #define __cpp_initializer_lists 200806L
2025-05-07T20:27:09.7220635Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1
2025-05-07T20:27:09.7220939Z #define __U16_TYPE unsigned short int
2025-05-07T20:27:09.7221285Z #define __glibcxx_requires_can_increment(_First,_Size) 
2025-05-07T20:27:09.7221643Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1
2025-05-07T20:27:09.7221940Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:27:09.7222226Z #define cudaHostRegisterIoMemory 0x04
2025-05-07T20:27:09.7222585Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS))
2025-05-07T20:27:09.7222942Z #define __cpp_lib_is_invocable 201703
2025-05-07T20:27:09.7223230Z #define _IO_STDIO 040000
2025-05-07T20:27:09.7223560Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int)))
2025-05-07T20:27:09.7223957Z #define cudaSurfaceType1DLayered 0xF1
2025-05-07T20:27:09.7224280Z #define cudaArraySurfaceLoadStore 0x02
2025-05-07T20:27:09.7224573Z #define _PTRDIFF_T 
2025-05-07T20:27:09.7224794Z #define _MOVE_H 1
2025-05-07T20:27:09.7225027Z #define __cpp_hex_float 201603L
2025-05-07T20:27:09.7236532Z #define ADJ_TAI 0x0080
2025-05-07T20:27:09.7236835Z #define __ptrvalue 
2025-05-07T20:27:09.7237078Z #define _GLIBCXX_HOSTED 1
2025-05-07T20:27:09.7237344Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:27:09.7237813Z #define __WTERMSIG(status) ((status) & 0x7f)
2025-05-07T20:27:09.7238130Z #define MATH_ERREXCEPT 2
2025-05-07T20:27:09.7238406Z #define _GLIBCXX_HAS_GTHREADS 1
2025-05-07T20:27:09.7238706Z #define cudaTextureType2DLayered 0xF2
2025-05-07T20:27:09.7239109Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0))
2025-05-07T20:27:09.7239507Z #define __USE_GNU 1
2025-05-07T20:27:09.7239750Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:27:09.7240029Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:27:09.7240306Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:27:09.7240708Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d)))
2025-05-07T20:27:09.7241120Z #define WEXITED 4
2025-05-07T20:27:09.7241344Z #define _IO_NO_READS 4
2025-05-07T20:27:09.7241651Z #define cudaGraphKernelNodePortLaunchCompletion 2
2025-05-07T20:27:09.7242008Z #define M_LOG2E 1.4426950408889634074
2025-05-07T20:27:09.7242298Z #define _POSIX_SYMLINK_MAX 255
2025-05-07T20:27:09.7242660Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1
2025-05-07T20:27:09.7242989Z #define __uid_t_defined 
2025-05-07T20:27:09.7243647Z #define __FD_ELT(d) ((d) / __NFDBITS)
2025-05-07T20:27:09.7243950Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1
2025-05-07T20:27:09.7244235Z #define WNOHANG 1
2025-05-07T20:27:09.7244480Z #define alloca(size) __builtin_alloca (size)
2025-05-07T20:27:09.7244799Z #define _GLIBCXX_HAVE_HYPOTF 1
2025-05-07T20:27:09.7245075Z #define cudaEventDefault 0x00
2025-05-07T20:27:09.7245379Z #define __maxnreg__(a) __attribute__((maxnreg(a)))
2025-05-07T20:27:09.7245696Z #define NL_SETMAX INT_MAX
2025-05-07T20:27:09.7245933Z #define __x86_64 1
2025-05-07T20:27:09.7246166Z #define __cudaCDP2LaunchDevice 
2025-05-07T20:27:09.7246772Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias))
2025-05-07T20:27:09.7247267Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 {
2025-05-07T20:27:09.7247779Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__))
2025-05-07T20:27:09.7248230Z #define __PTRDIFF_T 
2025-05-07T20:27:09.7248563Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW
2025-05-07T20:27:09.7248954Z #define _GLIBCXX_HAVE_FINITEL 1
2025-05-07T20:27:09.7249235Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:09.7249532Z #define _Mlong_double_ long double
2025-05-07T20:27:09.7249823Z #define __cpp_lambdas 200907L
2025-05-07T20:27:09.7250090Z #define _IO_DEC 020
2025-05-07T20:27:09.7250320Z #define _GLIBCXX_HAVE_SINHL 1
2025-05-07T20:27:09.7250596Z #define _POSIX_CLOCKRES_MIN 20000000
2025-05-07T20:27:09.7250895Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:27:09.7251182Z #define ADJ_TIMECONST 0x0020
2025-05-07T20:27:09.7251458Z #define _GLIBCXX_HAVE_SQRTL 1
2025-05-07T20:27:09.7251762Z #define __cudaCDP2DeviceGetSharedMemConfig 
2025-05-07T20:27:09.7252092Z #define _GLIBCXX_HAVE_STDALIGN_H 1
2025-05-07T20:27:09.7252377Z #define _ANSI_STDDEF_H 
2025-05-07T20:27:09.7252669Z #define _GLIBCXX_MOVE(__val) std::move(__val)
2025-05-07T20:27:09.7252994Z #define _GLIBCXX_HAVE_STRERROR_L 1
2025-05-07T20:27:09.7253371Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:27:09.7253772Z #define _GLIBCXX_USE_DEV_RANDOM 1
2025-05-07T20:27:09.7254068Z #define _STL_ITERATOR_BASE_TYPES_H 1
2025-05-07T20:27:09.7254368Z #define __cpp_template_auto 201606L
2025-05-07T20:27:09.7254913Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:27:09.7255301Z #define _GLIBCXX_HAVE_SYS_SEM_H 1
2025-05-07T20:27:09.7255578Z #define __key_t_defined 
2025-05-07T20:27:09.7255841Z #define _IO_MAGIC_MASK 0xFFFF0000
2025-05-07T20:27:09.7256225Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__)))
2025-05-07T20:27:09.7256776Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:27:09.7257436Z #define __GNUC_VA_LIST 
2025-05-07T20:27:09.7257785Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:27:09.7258184Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:27:09.7258451Z #define CLOCK_REALTIME_COARSE 5
2025-05-07T20:27:09.7258747Z #define _GLIBCXX14_CONSTEXPR constexpr
2025-05-07T20:27:09.7259052Z #define __USE_XOPEN2KXSI 1
2025-05-07T20:27:09.7259301Z #define __WCOREFLAG 0x80
2025-05-07T20:27:09.7259559Z #define M_2_SQRTPI 1.12837916709551257390
2025-05-07T20:27:09.7259874Z #define cudaEventDisableTiming 0x02
2025-05-07T20:27:09.7260153Z #define __LP64__ 1
2025-05-07T20:27:09.7260403Z #define __isascii_l(c,l) ((l), __isascii (c))
2025-05-07T20:27:09.7260734Z #define cudaStreamNonBlocking 0x01
2025-05-07T20:27:09.7261019Z #define _IO_off64_t __off64_t
2025-05-07T20:27:09.7261300Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:27:09.7261583Z #define __time_t_defined 1
2025-05-07T20:27:09.7261848Z #define _POSIX_SYMLOOP_MAX 8
2025-05-07T20:27:09.7262199Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:27:09.7262586Z #define __USE_UNIX98 1
2025-05-07T20:27:09.7262840Z #define __MODE_T_TYPE __U32_TYPE
2025-05-07T20:27:09.7263117Z #define CLOCK_REALTIME_ALARM 8
2025-05-07T20:27:09.7263505Z #define _GLIBCXX_HAVE_STRINGS_H 1
2025-05-07T20:27:09.7263816Z #define __LEAF_ATTR __attribute__ ((__leaf__))
2025-05-07T20:27:09.7264130Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:27:09.7264392Z #define SEEK_CUR 1
2025-05-07T20:27:09.7264623Z #define __RLIM64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:09.7264893Z #define _ASSERT_H 1
2025-05-07T20:27:09.7265473Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig))
2025-05-07T20:27:09.7266125Z #define _GLIBCXX_USE_DEPRECATED 1
2025-05-07T20:27:09.7266502Z #define CHAR_MAX SCHAR_MAX
2025-05-07T20:27:09.7266760Z #define _GLIBCXX_HAVE_SETENV 1
2025-05-07T20:27:09.7267037Z #define NL_ARGMAX _POSIX_ARG_MAX
2025-05-07T20:27:09.7267326Z #define _GLIBCXX_USE_UTIMENSAT 1
2025-05-07T20:27:09.7267712Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__))
2025-05-07T20:27:09.7268142Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 
2025-05-07T20:27:09.7268826Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch)))
2025-05-07T20:27:09.7269495Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1
2025-05-07T20:27:09.7269808Z #define _IO_BOOLALPHA 0200000
2025-05-07T20:27:09.7270170Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912)
2025-05-07T20:27:09.7270567Z #define _GLIBCXX_PACKAGE_URL ""
2025-05-07T20:27:09.7270844Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:27:09.7271143Z #define cudaArrayDefault 0x00
2025-05-07T20:27:09.7271440Z #define __cudaCDP2LaunchDeviceV2 
2025-05-07T20:27:09.7271743Z #define __FDS_BITS(set) ((set)->fds_bits)
2025-05-07T20:27:09.7272051Z #define TLOSS 5
2025-05-07T20:27:09.7272291Z #define __ssize_t_defined 
2025-05-07T20:27:09.7272583Z #define __CUDACC_VER_BUILD__ 85
2025-05-07T20:27:09.7272895Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1
2025-05-07T20:27:09.7273205Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL)
2025-05-07T20:27:09.7273514Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:27:09.7273890Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11
2025-05-07T20:27:09.7274297Z #define _POSIX_HIWAT _POSIX_PIPE_BUF
2025-05-07T20:27:09.7274602Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:27:09.7274900Z #define __cudaCDP2EventRecordWithFlags 
2025-05-07T20:27:09.7275229Z #define _GLIBCXX_ATOMIC_BUILTINS 1
2025-05-07T20:27:09.7275545Z #define cudaPeerAccessDefault 0x00
2025-05-07T20:27:09.7275838Z #define __REGISTER_PREFIX__ 
2025-05-07T20:27:09.7276112Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:27:09.7276467Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 
2025-05-07T20:27:09.7276835Z #define _IOS_NOREPLACE 64
2025-05-07T20:27:09.7277086Z #define __cdecl 
2025-05-07T20:27:09.7277333Z #define cudaEventInterprocess 0x04
2025-05-07T20:27:09.7277670Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L
2025-05-07T20:27:09.7278016Z #define LOGIN_NAME_MAX 256
2025-05-07T20:27:09.7278293Z #define _IO_TIED_PUT_GET 0x400
2025-05-07T20:27:09.7278568Z #define X_TLOSS 1.41484755040568800000e+16
2025-05-07T20:27:09.7278874Z #define CUDA_IPC_HANDLE_SIZE 64
2025-05-07T20:27:09.7279154Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:27:09.7279475Z #define __attribute_pure__ __attribute__ ((__pure__))
2025-05-07T20:27:09.7279817Z #define __TEXTURE_TYPES_H__ 
2025-05-07T20:27:09.7280239Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:27:09.7280696Z #define ADJ_NANO 0x2000
2025-05-07T20:27:09.7281012Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:27:09.7281388Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:27:09.7281691Z #define _GLIBCXX_HAVE_ISWBLANK 1
2025-05-07T20:27:09.7281960Z #define __FLT_DIG__ 6
2025-05-07T20:27:09.7282323Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias)
2025-05-07T20:27:09.7282738Z #define __NO_INLINE__ 1
2025-05-07T20:27:09.7283158Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:27:09.7283516Z #define _POSIX_NGROUPS_MAX 8
2025-05-07T20:27:09.7283781Z #define ADJ_STATUS 0x0010
2025-05-07T20:27:09.7284047Z #define __cudaCDP2MemcpyAsync_ptsz 
2025-05-07T20:27:09.7284336Z #define CLOCK_BOOTTIME_ALARM 9
2025-05-07T20:27:09.7284611Z #define LONG_LONG_MAX __LONG_LONG_MAX__
2025-05-07T20:27:09.7284918Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1
2025-05-07T20:27:09.7285207Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:27:09.7285710Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000
2025-05-07T20:27:09.7288062Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1
2025-05-07T20:27:09.7288421Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:27:09.7288784Z #define CHAR_MIN SCHAR_MIN
2025-05-07T20:27:09.7289044Z #define MAX_CANON 255
2025-05-07T20:27:09.7289284Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:27:09.7289549Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:27:09.7289835Z #define _GLIBCXX_HAVE_COMPLEX_H 1
2025-05-07T20:27:09.7290152Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 
2025-05-07T20:27:09.7290469Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX
2025-05-07T20:27:09.7290786Z #define _GLIBCXX_HAVE_HYPOT 1
2025-05-07T20:27:09.7291082Z #define __cudaCDP2Memset2DAsync_ptsz 
2025-05-07T20:27:09.7291414Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1
2025-05-07T20:27:09.7291748Z #define __VERSION__ "11.4.0"
2025-05-07T20:27:09.7292029Z #define _GLIBCXX11_USE_C99_STDLIB 1
2025-05-07T20:27:09.7292337Z #define cudaHostRegisterMapped 0x02
2025-05-07T20:27:09.7292644Z #define _GLIBCXX_HAVE_INT64_T 1
2025-05-07T20:27:09.7292948Z #define _GLIBCXX_USE_CONSTEXPR constexpr
2025-05-07T20:27:09.7293269Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp)
2025-05-07T20:27:09.7293586Z #define __UINT64_C(c) c ## UL
2025-05-07T20:27:09.7293858Z #define MOD_OFFSET ADJ_OFFSET
2025-05-07T20:27:09.7294126Z #define _SYS_TYPES_H 1
2025-05-07T20:27:09.7294372Z #define AIO_PRIO_DELTA_MAX 20
2025-05-07T20:27:09.7294760Z #define _GLIBCXX_HAVE_TANHF 1
2025-05-07T20:27:09.7295029Z #define _SYS_CDEFS_H 1
2025-05-07T20:27:09.7295268Z #define _GLIBCXX_HAVE_TANHL 1
2025-05-07T20:27:09.7295554Z #define __cpp_unicode_characters 201411L
2025-05-07T20:27:09.7295860Z #define _IO_ERR_SEEN 0x20
2025-05-07T20:27:09.7296120Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1
2025-05-07T20:27:09.7296426Z #define __cudaCDP2StreamDestroy 
2025-05-07T20:27:09.7296710Z #define FP_SUBNORMAL 3
2025-05-07T20:27:09.7296965Z #define cudaOccupancyDefault 0x00
2025-05-07T20:27:09.7297261Z #define _INITIALIZER_LIST 
2025-05-07T20:27:09.7297523Z #define _STDC_PREDEF_H 1
2025-05-07T20:27:09.7297783Z #define __CUDA_RUNTIME_API_H__ 
2025-05-07T20:27:09.7298077Z #define _GLIBCXX_PACKAGE_BUGREPORT ""
2025-05-07T20:27:09.7298375Z #define _GLIBCXX_HAVE_MODF 1
2025-05-07T20:27:09.7298639Z #define _IO_file_flags _flags
2025-05-07T20:27:09.7298907Z #define __USE_XOPEN2K8 1
2025-05-07T20:27:09.7299165Z #define htobe64(x) __bswap_64 (x)
2025-05-07T20:27:09.7299450Z #define _OLD_STDIO_MAGIC 0xFABC0000
2025-05-07T20:27:09.7299737Z #define HUGE 3.40282347e+38F
2025-05-07T20:27:09.7300014Z #define __cpp_lib_is_null_pointer 201309
2025-05-07T20:27:09.7300405Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status))
2025-05-07T20:27:09.7300809Z #define islower_l(c,l) __islower_l ((c), (l))
2025-05-07T20:27:09.7301132Z #define _GLIBCXX_USE_CXX11_ABI 1
2025-05-07T20:27:09.7301417Z #define _GLIBCXX_HAVE_SYMLINK 1
2025-05-07T20:27:09.7301682Z #define _BSD_SOURCE 1
2025-05-07T20:27:09.7301933Z #define _GLIBCXX_THROW(_EXC) 
2025-05-07T20:27:09.7302822Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template<typename _Tp, typename = __void_t<>> struct __has_ ##_NTYPE : false_type { }; template<typename _Tp> struct __has_ ##_NTYPE<_Tp, __void_t<typename _Tp::_NTYPE>> : true_type { };
2025-05-07T20:27:09.7303742Z #define __catch(X) catch(X)
2025-05-07T20:27:09.7304010Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:27:09.7304323Z #define LINE_MAX _POSIX2_LINE_MAX
2025-05-07T20:27:09.7304615Z #define __TIMER_T_TYPE void *
2025-05-07T20:27:09.7304985Z #define __STRING(x) #x
2025-05-07T20:27:09.7305247Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:27:09.7305541Z #define _T_PTRDIFF_ 
2025-05-07T20:27:09.7305791Z #define _GLIBCXX_USE_NOEXCEPT noexcept
2025-05-07T20:27:09.7306113Z #define cudaEventWaitExternal 0x01
2025-05-07T20:27:09.7306409Z #define __unbounded 
2025-05-07T20:27:09.7306658Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:09.7306978Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:27:09.7307274Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:09.7307594Z #define be16toh(x) __bswap_16 (x)
2025-05-07T20:27:09.7307973Z #define __cpp_lib_is_final 201402L
2025-05-07T20:27:09.7308288Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 
2025-05-07T20:27:09.7308633Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL)
2025-05-07T20:27:09.7308947Z #define __MATH_DECLARE_LDOUBLE 1
2025-05-07T20:27:09.7309238Z #define __managed__ __location__(managed)
2025-05-07T20:27:09.7309556Z #define _POSIX2_EXPR_NEST_MAX 32
2025-05-07T20:27:09.7309969Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:27:09.7310409Z #define _POSIX_STREAM_MAX 8
2025-05-07T20:27:09.7310675Z #define __LIBRARY_TYPES_H__ 
2025-05-07T20:27:09.7311058Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11
2025-05-07T20:27:09.7311470Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:27:09.7311727Z #define _SYS_SIZE_T_H 
2025-05-07T20:27:09.7312027Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10)
2025-05-07T20:27:09.7312370Z #define _GLIBCXX_STDLIB_H 1
2025-05-07T20:27:09.7312662Z #define isupper_l(c,l) __isupper_l ((c), (l))
2025-05-07T20:27:09.7312966Z #define _CRTIMP 
2025-05-07T20:27:09.7313189Z #define _GLIBCXX_CXX_CONFIG_H 1
2025-05-07T20:27:09.7313500Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:27:09.7313836Z #define STA_PPSJITTER 0x0200
2025-05-07T20:27:09.7314194Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0)
2025-05-07T20:27:09.7314618Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:09.7314943Z #define _GLIBCXX_HAVE_ISINFF 1
2025-05-07T20:27:09.7315232Z #define __glibcxx_requires_subscript(_N) 
2025-05-07T20:27:09.7315523Z #define __SIZE_T__ 
2025-05-07T20:27:09.7315741Z #define __stub_gtty 
2025-05-07T20:27:09.7315971Z #define __pid_t_defined 
2025-05-07T20:27:09.7316223Z #define _GLIBCXX_FWDREF(_Tp) _Tp&&
2025-05-07T20:27:09.7316541Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:09.7318347Z #define __glibcxx_function_requires(...) 
2025-05-07T20:27:09.7318642Z #define __SM_80_RT_HPP__ 
2025-05-07T20:27:09.7318900Z #define __need_clockid_t 
2025-05-07T20:27:09.7319223Z #define SSIZE_MAX LONG_MAX
2025-05-07T20:27:09.7319680Z #define _GLIBCXX_HAVE_USELOCALE 1
2025-05-07T20:27:09.7320009Z #define __glibcxx_requires_string_len(_String,_Len) 
2025-05-07T20:27:09.7320345Z #define _IO_HEX 0100
2025-05-07T20:27:09.7320603Z #define __NFDBITS (8 * (int) sizeof (__fd_mask))
2025-05-07T20:27:09.7320954Z #define cudaExternalMemoryDedicated 0x1
2025-05-07T20:27:09.7321272Z #define _GLIBCXX_HAVE_TGMATH_H 1
2025-05-07T20:27:09.7321551Z #define _GLIBCXX11_USE_C99_COMPLEX 1
2025-05-07T20:27:09.7321964Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:27:09.7322440Z #define ispunct_l(c,l) __ispunct_l ((c), (l))
2025-05-07T20:27:09.7322801Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:27:09.7323096Z #define __cudaGet_blockDim() blockDim
2025-05-07T20:27:09.7323215Z #define __cudaCDP2Memcpy3DAsync 
2025-05-07T20:27:09.7323322Z #define __cudaCDP2MemcpyAsync 
2025-05-07T20:27:09.7323417Z #define __stub_sstk 
2025-05-07T20:27:09.7323516Z #define _IO_IN_BACKUP 0x100
2025-05-07T20:27:09.7323672Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB
2025-05-07T20:27:09.7323758Z #define __wur 
2025-05-07T20:27:09.7323886Z #define isprint_l(c,l) __isprint_l ((c), (l))
2025-05-07T20:27:09.7323973Z #define _G_HAVE_MMAP 1
2025-05-07T20:27:09.7324064Z #define _IO_OCT 040
2025-05-07T20:27:09.7324305Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:27:09.7324401Z #define NL_MSGMAX INT_MAX
2025-05-07T20:27:09.7324503Z #define _GLIBCXX_USE_LFS 1
2025-05-07T20:27:09.7324632Z #define cudaDeviceScheduleBlockingSync 0x04
2025-05-07T20:27:09.7324725Z #define _POSIX_RTSIG_MAX 8
2025-05-07T20:27:09.7324837Z #define _GLIBCXX_NOEXCEPT noexcept
2025-05-07T20:27:09.7325028Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 
2025-05-07T20:27:09.7325124Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:27:09.7325222Z #define _STL_ALGOBASE_H 1
2025-05-07T20:27:09.7325674Z #define __cudaCDP2MemsetAsync_ptsz 
2025-05-07T20:27:09.7325806Z #define __off64_t_defined 
2025-05-07T20:27:09.7325916Z #define _GLIBCXX_WEAK_DEFINITION 
2025-05-07T20:27:09.7326006Z #define __FLT128_DIG__ 33
2025-05-07T20:27:09.7326119Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1
2025-05-07T20:27:09.7326221Z #define _GLIBCXX_HAVE_LOCALE_H 1
2025-05-07T20:27:09.7326307Z #define __INT32_C(c) c
2025-05-07T20:27:09.7326418Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:27:09.7326521Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:27:09.7326619Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:27:09.7326719Z #define __PDP_ENDIAN 3412
2025-05-07T20:27:09.7326809Z #define _ISOC95_SOURCE 1
2025-05-07T20:27:09.7326908Z #define _IO_fpos64_t _G_fpos64_t
2025-05-07T20:27:09.7327047Z #define M_PI_2l 1.570796326794896619231321691639751442L
2025-05-07T20:27:09.7327146Z #define BYTE_ORDER __BYTE_ORDER
2025-05-07T20:27:09.7327236Z #define __SM_90_RT_HPP__ 
2025-05-07T20:27:09.7327341Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:27:09.7327444Z #define __have_pthread_attr_t 1
2025-05-07T20:27:09.7327554Z #define _GLIBCXX_HAVE_LIMIT_DATA 1
2025-05-07T20:27:09.7327778Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11
2025-05-07T20:27:09.7327889Z #define __cudaCDP2StreamWaitEvent 
2025-05-07T20:27:09.7327997Z #define __cudaCDP2EventRecord 
2025-05-07T20:27:09.7328091Z #define _BITS_TYPESIZES_H 1
2025-05-07T20:27:09.7328183Z #define htole32(x) (x)
2025-05-07T20:27:09.7328447Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 
2025-05-07T20:27:09.7328570Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE
2025-05-07T20:27:09.7328671Z #define _GLIBCXX_USE_C99_MATH_TR1 1
2025-05-07T20:27:09.7328838Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status))
2025-05-07T20:27:09.7328978Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH
2025-05-07T20:27:09.7329112Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:27:09.7329253Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0)
2025-05-07T20:27:09.7329352Z #define ADJ_OFFSET 0x0001
2025-05-07T20:27:09.7329460Z #define cudaArrayLayered 0x01
2025-05-07T20:27:09.7329627Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800)
2025-05-07T20:27:09.7329741Z #define cudaEventRecordDefault 0x00
2025-05-07T20:27:09.7329844Z #define _GLIBCXX_HAVE_FMODF 1
2025-05-07T20:27:09.7329944Z #define _PSTL_PRAGMA_MESSAGE(x) 
2025-05-07T20:27:09.7330033Z #define unix 1
2025-05-07T20:27:09.7330134Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:27:09.7330227Z #define _POSIX_CHILD_MAX 25
2025-05-07T20:27:09.7330320Z #define _POSIX_MAX_INPUT 255
2025-05-07T20:27:09.7330447Z #define __cudaCDP2DeviceGetCacheConfig 
2025-05-07T20:27:09.7330534Z #define __USE_POSIX 1
2025-05-07T20:27:09.7330636Z #define __FD_ZERO_STOS "stosq"
2025-05-07T20:27:09.7330768Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000)
2025-05-07T20:27:09.7330862Z #define __THROWNL throw ()
2025-05-07T20:27:09.7330963Z #define __cpp_rtti 199711L
2025-05-07T20:27:09.7331073Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:27:09.7331162Z #define __PMT(args) args
2025-05-07T20:27:09.7331285Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:09.7331436Z #define __va_arg_pack_len() __builtin_va_arg_pack_len ()
2025-05-07T20:27:09.7331551Z #define __ULONGWORD_TYPE unsigned long int
2025-05-07T20:27:09.7331648Z #define _SIZE_T_DECLARED 
2025-05-07T20:27:09.7331984Z #define _PSTL_STRING_AUX(x) #x
2025-05-07T20:27:09.7332088Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:27:09.7332485Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402)
2025-05-07T20:27:09.7332587Z #define _GLIBCXX_HAVE_LIMIT_AS 1
2025-05-07T20:27:09.7332689Z #define XATTR_LIST_MAX 65536
2025-05-07T20:27:09.7332786Z #define __CUDACC_VER_MAJOR__ 12
2025-05-07T20:27:09.7332929Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:27:09.7333018Z #define _WCHAR_T_H 
2025-05-07T20:27:09.7333110Z #define __FLT64X_DIG__ 18
2025-05-07T20:27:09.7333328Z #define _IO_SHOWBASE 0200
2025-05-07T20:27:09.7333421Z #define _POSIX_QLIMIT 1
2025-05-07T20:27:09.7333521Z #define __INT8_TYPE__ signed char
2025-05-07T20:27:09.7333618Z #define __SURFACE_TYPES_H__ 
2025-05-07T20:27:09.7333712Z #define __CUDA_ARCH__ 520
2025-05-07T20:27:09.7333823Z #define __cpp_digit_separators 201309L
2025-05-07T20:27:09.7333913Z #define __ELF__ 1
2025-05-07T20:27:09.7334019Z #define CLOCK_THREAD_CPUTIME_ID 3
2025-05-07T20:27:09.7334123Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:27:09.7334215Z #define STA_INS 0x0010
2025-05-07T20:27:09.7334316Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:27:09.7334487Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)])
2025-05-07T20:27:09.7334692Z #define _BITS_BYTESWAP_H 1
2025-05-07T20:27:09.7334797Z #define __ID_T_TYPE __U32_TYPE
2025-05-07T20:27:09.7334908Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:09.7335024Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 
2025-05-07T20:27:09.7335127Z #define _GLIBCXX_HAVE_MBSTATE_T 1
2025-05-07T20:27:09.7335237Z #define __cpp_lib_logical_traits 201510
2025-05-07T20:27:09.7335335Z #define ADJ_OFFSET_SS_READ 0xa001
2025-05-07T20:27:09.7335490Z #define __warnattr(msg) __attribute__((__warning__ (msg)))
2025-05-07T20:27:09.7335657Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: "
2025-05-07T20:27:09.7335757Z #define _IO_funlockfile(_fp) 
2025-05-07T20:27:09.7336092Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:27:09.7336231Z #define M_2_PIl 0.636619772367581343075535053490057448L
2025-05-07T20:27:09.7336323Z #define __DRIVER_TYPES_H__ 
2025-05-07T20:27:09.7336410Z #define __FLT_RADIX__ 2
2025-05-07T20:27:09.7336521Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:27:09.7336690Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:27:09.7336795Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:27:09.7336889Z #define _GLIBCXX_USE_LSTAT 1
2025-05-07T20:27:09.7337000Z #define minor(dev) gnu_dev_minor (dev)
2025-05-07T20:27:09.7337105Z #define _POSIX_C_SOURCE 200809L
2025-05-07T20:27:09.7337203Z #define _GLIBCXX_HAVE_DIRENT_H 1
2025-05-07T20:27:09.7337305Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:27:09.7337401Z #define WORD_BIT 32
2025-05-07T20:27:09.7337489Z #define _IO_USER_BUF 1
2025-05-07T20:27:09.7337586Z #define __VECTOR_TYPES_H__ 
2025-05-07T20:27:09.7337706Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:09.7337814Z #define cudaHostAllocPortable 0x01
2025-05-07T20:27:09.7337913Z #define PTHREAD_STACK_MIN 16384
2025-05-07T20:27:09.7338028Z #define __long_double_t long double
2025-05-07T20:27:09.7338124Z #define _GLIBCXX_HAVE_ISINF 1
2025-05-07T20:27:09.7338223Z #define _POSIX_ARG_MAX 4096
2025-05-07T20:27:09.7338625Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode
2025-05-07T20:27:09.7338709Z #define __k8 1
2025-05-07T20:27:09.7338913Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23)
2025-05-07T20:27:09.7339094Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:27:09.7339211Z #define __LDBL_REDIR(name,proto) name proto
2025-05-07T20:27:09.7339322Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:27:09.7339421Z #define __SM_30_INTRINSICS_HPP__ 
2025-05-07T20:27:09.7339522Z #define _GLIBCXX_EXTERN_TEMPLATE 1
2025-05-07T20:27:09.7339828Z #define __blksize_t_defined 
2025-05-07T20:27:09.7339932Z #define _IO_SHOWPOINT 0400
2025-05-07T20:27:09.7340034Z #define _GLIBCXX_HAVE_LIMIT_RSS 1
2025-05-07T20:27:09.7340148Z #define cudaDeviceLmemResizeToMax 0x10
2025-05-07T20:27:09.7340243Z #define _GLIBCXX_X86_RDRAND 1
2025-05-07T20:27:09.7340358Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:27:09.7340454Z #define _IO_IS_FILEBUF 0x2000
2025-05-07T20:27:09.7340550Z #define _GLIBCXX_USE_DUAL_ABI 1
2025-05-07T20:27:09.7340819Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8)))
2025-05-07T20:27:09.7341248Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2)
2025-05-07T20:27:09.7341354Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1)
2025-05-07T20:27:09.7341459Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:27:09.7341547Z #define SEEK_SET 0
2025-05-07T20:27:09.7341653Z #define _GLIBCXX_TR1_GAMMA_TCC 1
2025-05-07T20:27:09.7341754Z #define __CUDA_API_VER_MINOR__ 6
2025-05-07T20:27:09.7341957Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V)))
2025-05-07T20:27:09.7342070Z #define _GLIBCXX20_DEPRECATED(MSG) 
2025-05-07T20:27:09.7342175Z #define __cudaCDP2GetLastError 
2025-05-07T20:27:09.7342271Z #define _GLIBCXX_HAVE_COSL 1
2025-05-07T20:27:09.7342374Z #define _MATH_H_MATHDEF 1
2025-05-07T20:27:09.7342700Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24))
2025-05-07T20:27:09.7342805Z #define _GLIBCXX_USE_FLOAT128 1
2025-05-07T20:27:09.7342910Z #define _IO_FLAGS2_NOTCANCEL 2
2025-05-07T20:27:09.7343009Z #define __stub_sigreturn 
2025-05-07T20:27:09.7343266Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg)))
2025-05-07T20:27:09.7343364Z #define _GLIBCXX_HAVE_UTIME_H 1
2025-05-07T20:27:09.7343459Z #define __HOST_CONFIG_H__ 
2025-05-07T20:27:09.7343573Z #define _XOPEN_SOURCE_EXTENDED 1
2025-05-07T20:27:09.7343662Z #define CLOCK_TAI 11
2025-05-07T20:27:09.7343779Z #define _GLIBCXX_END_NAMESPACE_VERSION 
2025-05-07T20:27:09.7343879Z #define __restrict_arr 
2025-05-07T20:27:09.7343999Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 
2025-05-07T20:27:09.7344145Z #define __glibcxx_requires_valid_range(_First,_Last) 
2025-05-07T20:27:09.7344698Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:27:09.7344891Z #define __attribute_artificial__ __attribute__ ((__artificial__))
2025-05-07T20:27:09.7345023Z #define __USE_MISC 1
2025-05-07T20:27:09.7345163Z #define __UWORD_TYPE unsigned long int
2025-05-07T20:27:09.7345296Z #define _EXCEPTION_DEFINES_H 1
2025-05-07T20:27:09.7345395Z #define _GCC_LIMITS_H_ 
2025-05-07T20:27:09.7345624Z #define __LDBL_DIG__ 18
2025-05-07T20:27:09.7345722Z #define __BIT_TYPES_DEFINED__ 1
2025-05-07T20:27:09.7345829Z #define __malloc_and_calloc_defined 
2025-05-07T20:27:09.7345929Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:27:09.7346029Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1
2025-05-07T20:27:09.7346115Z #define __x86_64__ 1
2025-05-07T20:27:09.7346196Z #define _SIZE_T_ 
2025-05-07T20:27:09.7347091Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56)))
2025-05-07T20:27:09.7347199Z #define _POSIX2_COLL_WEIGHTS_MAX 2
2025-05-07T20:27:09.7347294Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:27:09.7347415Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1
2025-05-07T20:27:09.7347533Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:27:09.7347631Z #define _IO_iconv_t _G_iconv_t
2025-05-07T20:27:09.7347745Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1
2025-05-07T20:27:09.7347966Z #define __cpp_lib_make_reverse_iterator 201402
2025-05-07T20:27:09.7348121Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 
2025-05-07T20:27:09.7348255Z #define _GLIBCXX_HAVE_DLFCN_H 1
2025-05-07T20:27:09.7348889Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:27:09.7349024Z #define __no_return__ __attribute__((noreturn))
2025-05-07T20:27:09.7349170Z #define __device_builtin__ __location__(device_builtin)
2025-05-07T20:27:09.7349371Z #define _PSTL_HIDE_FROM_ABI_POP 
2025-05-07T20:27:09.7349476Z #define _GLIBCXX_HAVE_ACOSF 1
2025-05-07T20:27:09.7349566Z #define STA_FLL 0x0008
2025-05-07T20:27:09.7349718Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1
2025-05-07T20:27:09.7349819Z #define _GLIBCXX_END_EXTERN_C }
2025-05-07T20:27:09.7349948Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:09.7350074Z #define __cpp_lib_integer_sequence 201304
2025-05-07T20:27:09.7350165Z #define __stub_revoke 
2025-05-07T20:27:09.7350260Z #define __timer_t_defined 1
2025-05-07T20:27:09.7350412Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED
2025-05-07T20:27:09.7350509Z #define INT_MAX __INT_MAX__
2025-05-07T20:27:09.7350620Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1)
2025-05-07T20:27:09.7350743Z #define _GLIBCXX_END_NAMESPACE_CXX11 }
2025-05-07T20:27:09.7350845Z #define _GLIBCXX_ICONV_CONST 
2025-05-07T20:27:09.7350951Z #define major(dev) gnu_dev_major (dev)
2025-05-07T20:27:09.7351074Z #define cudaArrayTextureGather 0x08
2025-05-07T20:27:09.7351183Z #define _GLIBCXX_LT_OBJDIR ".libs/"
2025-05-07T20:27:09.7351344Z #define __inline_hint__ __attribute__((nv_inline_hint))
2025-05-07T20:27:09.7351446Z #define __NV_LEGACY_LAUNCH 1
2025-05-07T20:27:09.7351540Z #define _IO_off_t __off_t
2025-05-07T20:27:09.7351637Z #define __FLT64_DIG__ 15
2025-05-07T20:27:09.7351869Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS
2025-05-07T20:27:09.7351970Z #define _POSIX2_LINE_MAX 2048
2025-05-07T20:27:09.7352111Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:09.7352237Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:27:09.7352337Z #define ADJ_FREQUENCY 0x0002
2025-05-07T20:27:09.7352453Z #define __CUDART_API_PTDS(api) api
2025-05-07T20:27:09.7352542Z #define NULL __null
2025-05-07T20:27:09.7352684Z #define cudaStreamPerThread ((cudaStream_t)0x2)
2025-05-07T20:27:09.7352793Z #define _GLIBCXX_CONSTEXPR constexpr
2025-05-07T20:27:09.7352898Z #define __U64_TYPE unsigned long int
2025-05-07T20:27:09.7353010Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:27:09.7353109Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:27:09.7353196Z #define FP_ZERO 2
2025-05-07T20:27:09.7353302Z #define _GLIBCXX_HAVE_FLOORL 1
2025-05-07T20:27:09.7353458Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l))
2025-05-07T20:27:09.7353571Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:09.7353671Z #define __WCHAR_T__ 
2025-05-07T20:27:09.7353772Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:27:09.7353968Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:27:09.7354127Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__))
2025-05-07T20:27:09.7354224Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:27:09.7354350Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:27:09.7354465Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 
2025-05-07T20:27:09.7354593Z #define __WSTOPSIG(status) __WEXITSTATUS(status)
2025-05-07T20:27:09.7354724Z #define cudaSurfaceTypeCubemapLayered 0xFC
2025-05-07T20:27:09.7354821Z #define _BSD_PTRDIFF_T_ 
2025-05-07T20:27:09.7354910Z #define _SIGSET_H_types 1
2025-05-07T20:27:09.7355029Z #define cudaTextureType1DLayered 0xF1
2025-05-07T20:27:09.7355133Z #define __cpp_unicode_literals 200710L
2025-05-07T20:27:09.7355280Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l))
2025-05-07T20:27:09.7355390Z #define __LONG_LONG_PAIR(HI,LO) LO, HI
2025-05-07T20:27:09.7355632Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:27:09.7355784Z #define __bos0(ptr) __builtin_object_size (ptr, 0)
2025-05-07T20:27:09.7355991Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:27:09.7356191Z #define M_1_PIl 0.318309886183790671537767526745028724L
2025-05-07T20:27:09.7356401Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status))
2025-05-07T20:27:09.7368166Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:27:09.7368312Z #define _POSIX2_CHARCLASS_NAME_MAX 14
2025-05-07T20:27:09.7368411Z #define _GLIBCXX_BITS_STD_ABS_H 
2025-05-07T20:27:09.7368643Z #define STA_MODE 0x4000
2025-05-07T20:27:09.7368757Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:27:09.7368862Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:27:09.7368994Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0)
2025-05-07T20:27:09.7369100Z #define __USING_NAMESPACE_C99(name) 
2025-05-07T20:27:09.7369199Z #define BIG_ENDIAN __BIG_ENDIAN
2025-05-07T20:27:09.7369320Z #define __cudaCDP2EventRecord_ptsz 
2025-05-07T20:27:09.7369419Z #define _GLIBCXX_HAVE_SINL 1
2025-05-07T20:27:09.7369535Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX
2025-05-07T20:27:09.7369634Z #define __SIZE_WIDTH__ 64
2025-05-07T20:27:09.7369749Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:09.7369838Z #define __SEG_FS 1
2025-05-07T20:27:09.7369924Z #define _IO_size_t size_t
2025-05-07T20:27:09.7370033Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:27:09.7370135Z #define INT_MIN (-INT_MAX - 1)
2025-05-07T20:27:09.7370225Z #define __stub_lchmod 
2025-05-07T20:27:09.7370335Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:27:09.7370446Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:09.7370554Z #define _GLIBCXX_MANGLE_SIZE_T m
2025-05-07T20:27:09.7370638Z #define __SEG_GS 1
2025-05-07T20:27:09.7370823Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:27:09.7370925Z #define _IOS_APPEND 8
2025-05-07T20:27:09.7371022Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:27:09.7371120Z #define _GLIBCXX_RELEASE 11
2025-05-07T20:27:09.7371228Z #define _GLIBCXX98_USE_C99_WCHAR 1
2025-05-07T20:27:09.7371328Z #define _IO_IS_APPENDING 0x1000
2025-05-07T20:27:09.7371430Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:27:09.7371529Z #define htole16(x) (x)
2025-05-07T20:27:09.7371642Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:27:09.7371738Z #define _GLIBCXX_HAVE_FCNTL_H 1
2025-05-07T20:27:09.7371842Z #define __INT16_TYPE__ short int
2025-05-07T20:27:09.7371946Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:27:09.7372068Z #define __glibcxx_class_requires(_a,_b) 
2025-05-07T20:27:09.7372184Z #define __cpp_structured_bindings 201606L
2025-05-07T20:27:09.7372309Z #define __align__(n) __attribute__((aligned(n)))
2025-05-07T20:27:09.7372410Z #define __SIZEOF_INT__ 4
2025-05-07T20:27:09.7372503Z #define __WCLONE 0x80000000
2025-05-07T20:27:09.7372598Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:27:09.7372693Z #define SEEK_HOLE 4
2025-05-07T20:27:09.7372787Z #define TIMER_ABSTIME 1
2025-05-07T20:27:09.7372883Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:27:09.7372982Z #define __CUDA_MATH_CRTIMP 
2025-05-07T20:27:09.7373158Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:27:09.7373275Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:09.7373382Z #define __DRIVER_FUNCTIONS_H__ 
2025-05-07T20:27:09.7373494Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:27:09.7373602Z #define __MATH_FUNCTIONS_HPP__ 
2025-05-07T20:27:09.7373726Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:27:09.7373824Z #define _LINUX_LIMITS_H 
2025-05-07T20:27:09.7373915Z #define linux 1
2025-05-07T20:27:09.7374008Z #define MOD_MICRO ADJ_MICRO
2025-05-07T20:27:09.7374120Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 
2025-05-07T20:27:09.7374224Z #define _GLIBCXX_HAVE_VSWSCANF 1
2025-05-07T20:27:09.7374319Z #define _GLIBCXX_HAVE_ISNAN 1
2025-05-07T20:27:09.7374435Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV
2025-05-07T20:27:09.7374856Z #define __cudart_builtin__ __location__(cudart_builtin)
2025-05-07T20:27:09.7374964Z #define __cpp_lib_hypot 201603
2025-05-07T20:27:09.7375067Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:27:09.7375164Z #define _GLIBCXX_HAVE_WCTYPE_H 1
2025-05-07T20:27:09.7375254Z #define MOD_NANO ADJ_NANO
2025-05-07T20:27:09.7375347Z #define htole64(x) (x)
2025-05-07T20:27:09.7375449Z #define FP_ILOGBNAN (-2147483647 - 1)
2025-05-07T20:27:09.7375582Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_))
2025-05-07T20:27:09.7375678Z #define _IO_UPPERCASE 01000
2025-05-07T20:27:09.7376259Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference
2025-05-07T20:27:09.7376358Z #define __USE_POSIX2 1
2025-05-07T20:27:09.7376460Z #define MOD_ESTERROR ADJ_ESTERROR
2025-05-07T20:27:09.7376551Z #define __WALL 0x40000000
2025-05-07T20:27:09.7376657Z #define _GLIBCXX_HAVE_LDEXPF 1
2025-05-07T20:27:09.7376745Z #define _XLOCALE_H 1
2025-05-07T20:27:09.7376854Z #define _GLIBCXX_USE_TMPNAM 1
2025-05-07T20:27:09.7376973Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:27:09.7377102Z #define __KEY_T_TYPE __S32_TYPE
2025-05-07T20:27:09.7377247Z #define __cudaGet_threadIdx() threadIdx
2025-05-07T20:27:09.7377371Z #define __EXCEPTIONS 1
2025-05-07T20:27:09.7377504Z #define __CUDART_API_PTSZ(api) api
2025-05-07T20:27:09.7377707Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__))
2025-05-07T20:27:09.7377794Z #define __WORDSIZE 64
2025-05-07T20:27:09.7377888Z #define CLOCK_MONOTONIC 1
2025-05-07T20:27:09.7377985Z #define _STL_RELOPS_H 1
2025-05-07T20:27:09.7378086Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:27:09.7378184Z #define __BEGIN_DECLS extern "C" {
2025-05-07T20:27:09.7378289Z #define _GLIBCXX_HAVE_SYS_IPC_H 1
2025-05-07T20:27:09.7378385Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:27:09.7378487Z #define _GLIBCXX_HAVE_TRUNCATE 1
2025-05-07T20:27:09.7378798Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension
2025-05-07T20:27:09.7379039Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:27:09.7379187Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11::
2025-05-07T20:27:09.7379288Z #define _GLIBCXX_NUMERIC_LIMITS 1
2025-05-07T20:27:09.7379396Z #define __cpp_range_based_for 201603L
2025-05-07T20:27:09.7379516Z #define __cpp_lib_exchange_function 201304
2025-05-07T20:27:09.7379618Z #define _GLIBCXX_HAVE_INTTYPES_H 1
2025-05-07T20:27:09.7379728Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1
2025-05-07T20:27:09.7379920Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02
2025-05-07T20:27:09.7380024Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:27:09.7380124Z #define _GLIBCXX_CSTDLIB 1
2025-05-07T20:27:09.7380236Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1
2025-05-07T20:27:09.7380413Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:27:09.7380536Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:27:09.7380623Z #define _STRING_H 1
2025-05-07T20:27:09.7380731Z #define _BITS_PTHREADTYPES_H 1
2025-05-07T20:27:09.7380830Z #define _GCC_MAX_ALIGN_T 
2025-05-07T20:27:09.7380930Z #define __SM_32_INTRINSICS_HPP__ 
2025-05-07T20:27:09.7381067Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:27:09.7381173Z #define __code_model_small__ 1
2025-05-07T20:27:09.7381264Z #define _PSTL_CONFIG_H 
2025-05-07T20:27:09.7381371Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:27:09.7381495Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:27:09.7381591Z #define __SM_20_INTRINSICS_H__ 
2025-05-07T20:27:09.7381709Z #define cudaCpuDeviceId ((int)-1)
2025-05-07T20:27:09.7382055Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:27:09.7382151Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:27:09.7382248Z #define le64toh(x) (x)
2025-05-07T20:27:09.7382338Z #define FILENAME_MAX 4096
2025-05-07T20:27:09.7382583Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l))
2025-05-07T20:27:09.7382708Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:27:09.7382795Z #define L_cuserid 9
2025-05-07T20:27:09.7382884Z #define __ino_t_defined 
2025-05-07T20:27:09.7382977Z #define __k8__ 1
2025-05-07T20:27:09.7383079Z #define __INTPTR_TYPE__ long int
2025-05-07T20:27:09.7383190Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:27:09.7383284Z #define __int8_t_defined 
2025-05-07T20:27:09.7383377Z #define __WCHAR_TYPE__ int
2025-05-07T20:27:09.7383484Z #define __CLOCKID_T_TYPE __S32_TYPE
2025-05-07T20:27:09.7383602Z #define cudaHostRegisterPortable 0x01
2025-05-07T20:27:09.7383778Z #define __SLONGWORD_TYPE long int
2025-05-07T20:27:09.7383873Z #define _IOS_TRUNC 16
2025-05-07T20:27:09.7383994Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++"
2025-05-07T20:27:09.7384147Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l))
2025-05-07T20:27:09.7384243Z #define __HAVE_COLUMN 
2025-05-07T20:27:09.7384331Z #define __stub_fdetach 
2025-05-07T20:27:09.7384751Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported.  Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead."
2025-05-07T20:27:09.7384845Z #define __pic__ 2
2025-05-07T20:27:09.7384968Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:09.7385076Z #define CLOCKS_PER_SEC 1000000l
2025-05-07T20:27:09.7385172Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:27:09.7385276Z #define _GLIBCXX_HAVE_SOCKATMARK 1
2025-05-07T20:27:09.7385370Z #define __stub_chflags 
2025-05-07T20:27:09.7385460Z #define CLOCK_BOOTTIME 7
2025-05-07T20:27:09.7385547Z #define __need_IOV_MAX 
2025-05-07T20:27:09.7385669Z #define putc(_ch,_fp) _IO_putc (_ch, _fp)
2025-05-07T20:27:09.7385778Z #define __UQUAD_TYPE unsigned long int
2025-05-07T20:27:09.7385880Z #define __cpp_decltype 200707L
2025-05-07T20:27:09.7385990Z #define __BYTE_ORDER __LITTLE_ENDIAN
2025-05-07T20:27:09.7386083Z #define _GLIBCXX_USE_C99 1
2025-05-07T20:27:09.7386193Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1
2025-05-07T20:27:09.7386294Z #define TTY_NAME_MAX 32
2025-05-07T20:27:09.7386463Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val)
2025-05-07T20:27:09.7386597Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:09.7386770Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition)
2025-05-07T20:27:09.7386884Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:27:09.7386990Z #define __LITTLE_ENDIAN 1234
2025-05-07T20:27:09.7387086Z #define STA_PPSTIME 0x0004
2025-05-07T20:27:09.7387176Z #define __import__ 
2025-05-07T20:27:09.7387273Z #define BUFSIZ _IO_BUFSIZ
2025-05-07T20:27:09.7387415Z #define M_SQRT2l 1.414213562373095048801688724209698079L
2025-05-07T20:27:09.7387501Z #define __export__ 
2025-05-07T20:27:09.7387629Z #define __FSID_T_TYPE struct { int __val[2]; }
2025-05-07T20:27:09.7387733Z #define cudaMemAttachHost 0x02
2025-05-07T20:27:09.7387905Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:27:09.7388004Z #define _GLIBCXX_HAVE_ICONV 1
2025-05-07T20:27:09.7388099Z #define _GLIBCXX_SYMVER 1
2025-05-07T20:27:09.7388204Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:27:09.7388295Z #define _WCHAR_T_DECLARED 
2025-05-07T20:27:09.7388416Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:27:09.7388542Z #define isalpha_l(c,l) __isalpha_l ((c), (l))
2025-05-07T20:27:09.7388649Z #define __cpp_inline_variables 201606L
2025-05-07T20:27:09.7388740Z #define WNOWAIT 0x01000000
2025-05-07T20:27:09.7388831Z #define PLOSS 6
2025-05-07T20:27:09.7388926Z #define M_LN10 2.30258509299404568402
2025-05-07T20:27:09.7389191Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626)
2025-05-07T20:27:09.7389299Z #define EXIT_SUCCESS 0
2025-05-07T20:27:09.7389400Z #define __LDBL_REDIR_DECL(name) 
2025-05-07T20:27:09.7389503Z #define _GLIBCXX_HAVE_STRTOF 1
2025-05-07T20:27:09.7389605Z #define MOD_FREQUENCY ADJ_FREQUENCY
2025-05-07T20:27:09.7389698Z #define __thread__ __thread
2025-05-07T20:27:09.7389802Z #define _GLIBCXX_HAVE_MEMORY_H 1
2025-05-07T20:27:09.7389987Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:27:09.7390093Z #define __SIZEOF_PTHREAD_BARRIER_T 32
2025-05-07T20:27:09.7390329Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:27:09.7390446Z #define __cudaCDP2StreamWaitEvent_ptsz 
2025-05-07T20:27:09.7390543Z #define _GLIBCXX_HAVE_SINF 1
2025-05-07T20:27:09.7390635Z #define __linux__ 1
2025-05-07T20:27:09.7390733Z #define STA_PPSSIGNAL 0x0100
2025-05-07T20:27:09.7390869Z #define M_LN2l 0.693147180559945309417232121458176568L
2025-05-07T20:27:09.7390964Z #define __S16_TYPE short int
2025-05-07T20:27:09.7391427Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable()
2025-05-07T20:27:09.7391540Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1
2025-05-07T20:27:09.7391734Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1)
2025-05-07T20:27:09.7391834Z #define __COMMON_FUNCTIONS_H__ 
2025-05-07T20:27:09.7391949Z #define UINT_MAX (INT_MAX * 2U + 1U)
2025-05-07T20:27:09.7392035Z #define _T_SIZE_ 
2025-05-07T20:27:09.7392136Z #define LLONG_MAX __LONG_LONG_MAX__
2025-05-07T20:27:09.7392268Z #define __cudaCDP2StreamCreateWithFlags 
2025-05-07T20:27:09.7392369Z #define _PSTL_VERSION 12000
2025-05-07T20:27:09.7392521Z #define __noinline__ __attribute__((noinline))
2025-05-07T20:27:09.7392640Z #define __WNOTHREAD 0x20000000
2025-05-07T20:27:09.7392744Z #define _G_va_list __gnuc_va_list
2025-05-07T20:27:09.7392880Z #define M_PI_4l 0.785398163397448309615660845819875721L
2025-05-07T20:27:09.7392967Z #define _IOS_INPUT 1
2025-05-07T20:27:09.7393069Z #define __USE_LARGEFILE64 1
2025-05-07T20:27:09.7393184Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1
2025-05-07T20:27:09.7393278Z #define __INT64_TYPE__ long int
2025-05-07T20:27:09.7393377Z #define _POSIX_SSIZE_MAX 32767
2025-05-07T20:27:09.7393485Z #define __shared__ __location__(shared)
2025-05-07T20:27:09.7393580Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:27:09.7393743Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0)
2025-05-07T20:27:09.7393841Z #define __gid_t_defined 
2025-05-07T20:27:09.7393957Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1
2025-05-07T20:27:09.7394066Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:27:09.7394267Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 
2025-05-07T20:27:09.7394368Z #define _GLIBCXX17_INLINE inline
2025-05-07T20:27:09.7394467Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:27:09.7394556Z #define ___int_size_t_h 
2025-05-07T20:27:09.7394666Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:09.7394805Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:27:09.7394964Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED)
2025-05-07T20:27:09.7395070Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1
2025-05-07T20:27:09.7395175Z #define _GLIBCXX_HAVE_FENV_H 1
2025-05-07T20:27:09.7395275Z #define _GLIBCXX_HAVE_STDBOOL_H 1
2025-05-07T20:27:09.7395381Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:27:09.7395512Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:09.7395629Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1
2025-05-07T20:27:09.7395758Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 
2025-05-07T20:27:09.7395852Z #define __clock_t_defined 1
2025-05-07T20:27:09.7395953Z #define _POSIX_SEM_VALUE_MAX 32767
2025-05-07T20:27:09.7396075Z #define __cudaCDP2RuntimeGetVersion 
2025-05-07T20:27:09.7396166Z #define __GLIBC_MINOR__ 17
2025-05-07T20:27:09.7396262Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:27:09.7396371Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:27:09.7396484Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:27:09.7396585Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:27:09.7396767Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:27:09.7396852Z #define __SSE__ 1
2025-05-07T20:27:09.7396958Z #define SEM_VALUE_MAX (2147483647)
2025-05-07T20:27:09.7397058Z #define M_SQRT1_2 0.70710678118654752440
2025-05-07T20:27:09.7397143Z #define _CTYPE_H 1
2025-05-07T20:27:09.7397336Z #define __sigset_t_defined 
2025-05-07T20:27:09.7397437Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:27:09.7397534Z #define _GLIBCXX_HAVE_LOGF 1
2025-05-07T20:27:09.7397627Z #define MOD_TAI ADJ_TAI
2025-05-07T20:27:09.7397730Z #define _IO_va_list __gnuc_va_list
2025-05-07T20:27:09.7397826Z #define _GLIBCXX_HAVE_LOGL 1
2025-05-07T20:27:09.7397917Z #define __SM_70_RT_H__ 
2025-05-07T20:27:09.7398014Z #define _GLIBCXX_HAVE_WRITEV 1
2025-05-07T20:27:09.7398131Z #define cudaEventWaitDefault 0x00
2025-05-07T20:27:09.7398225Z #define _GLIBCXX_HAVE_EXPL 1
2025-05-07T20:27:09.7398469Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:27:09.7398572Z #define _POSIX_MAX_CANON 255
2025-05-07T20:27:09.7398683Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE
2025-05-07T20:27:09.7398778Z #define FD_SETSIZE __FD_SETSIZE
2025-05-07T20:27:09.7398879Z #define _GLIBCXX_TXN_SAFE 
2025-05-07T20:27:09.7398962Z #define __amd64__ 1
2025-05-07T20:27:09.7399052Z #define __WINT_WIDTH__ 32
2025-05-07T20:27:09.7399168Z #define __CUDA_DEVICE_RUNTIME_API_H__ 
2025-05-07T20:27:09.7399443Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias))
2025-05-07T20:27:09.7399547Z #define _GLIBCXX_STDIO_SEEK_CUR 1
2025-05-07T20:27:09.7399638Z #define EOF (-1)
2025-05-07T20:27:09.7399732Z #define __WAIT_STATUS_DEFN void *
2025-05-07T20:27:09.7399829Z #define __USE_POSIX199309 1
2025-05-07T20:27:09.7399920Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:27:09.7400012Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:27:09.7400109Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:27:09.7400211Z #define LLONG_MIN (-LLONG_MAX-1)
2025-05-07T20:27:09.7400321Z #define cudaSurfaceType2DLayered 0xF2
2025-05-07T20:27:09.7400417Z #define ____mbstate_t_defined 1
2025-05-07T20:27:09.7400502Z #define STA_NANO 0x2000
2025-05-07T20:27:09.7400592Z #define _GLIBCXX_HAVE_LOG10F 1
2025-05-07T20:27:09.7400692Z #define _GLIBCXX_HAVE_LOG10L 1
2025-05-07T20:27:09.7400775Z #define _IO_LINKED 0x80
2025-05-07T20:27:09.7400877Z #define __cpp_lib_launder 201606
2025-05-07T20:27:09.7400973Z #define __SIZEOF_INT128__ 16
2025-05-07T20:27:09.7401073Z #define __PTHREAD_MUTEX_HAVE_PREV 1
2025-05-07T20:27:09.7401175Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:27:09.7401268Z #define _GLIBCXX_TYPE_TRAITS 1
2025-05-07T20:27:09.7401407Z #define cudaGraphKernelNodePortProgrammatic 1
2025-05-07T20:27:09.7401523Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:09.7401623Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE
2025-05-07T20:27:09.7401721Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:27:09.7401824Z #define __W_CONTINUED 0xffff
2025-05-07T20:27:09.7401910Z #define __ATOMIC_RELAXED 0
2025-05-07T20:27:09.7402039Z #define w_coredump __wait_terminated.__w_coredump
2025-05-07T20:27:09.7402167Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:09.7402369Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 
2025-05-07T20:27:09.7402564Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:27:09.7402646Z #define __stub_stty 
2025-05-07T20:27:09.7402813Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)])
2025-05-07T20:27:09.7402904Z #define le16toh(x) (x)
2025-05-07T20:27:09.7403009Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX
2025-05-07T20:27:09.7403184Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:27:09.7403272Z #define _SIZET_ 
2025-05-07T20:27:09.7403362Z #define XATTR_NAME_MAX 255
2025-05-07T20:27:09.7403447Z #define _SVID_SOURCE 1
2025-05-07T20:27:09.7403532Z #define _LP64 1
2025-05-07T20:27:09.7403623Z #define _LIBC_LIMITS_H_ 1
2025-05-07T20:27:09.7403859Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias)
2025-05-07T20:27:09.7403976Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1
2025-05-07T20:27:09.7404061Z #define __UINT8_C(c) c
2025-05-07T20:27:09.7404161Z #define _GLIBCXX_HAVE_CEILF 1
2025-05-07T20:27:09.7404252Z #define _GLIBCXX_HAVE_CEILL 1
2025-05-07T20:27:09.7404451Z #define __cudaCDP2Memset3DAsync_ptsz 
2025-05-07T20:27:09.7404552Z #define __CUDA_ARCH_LIST__ 520
2025-05-07T20:27:09.7404643Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:27:09.7404740Z #define MOD_MAXERROR ADJ_MAXERROR
2025-05-07T20:27:09.7404828Z #define CUDARTAPI 
2025-05-07T20:27:09.7404909Z #define IOV_MAX 1024
2025-05-07T20:27:09.7405070Z #define __glibcxx_requires_irreflexive2(_First,_Last) 
2025-05-07T20:27:09.7405201Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:27:09.7405751Z #define cudaMemAttachSingle 0x04
2025-05-07T20:27:09.7405870Z #define __wchar_t__ 
2025-05-07T20:27:09.7406099Z #define __cpp_lib_is_aggregate 201703
2025-05-07T20:27:09.7406183Z #define SEEK_END 2
2025-05-07T20:27:09.7406282Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:27:09.7406457Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include(<tbb/tbb.h>)
2025-05-07T20:27:09.7406556Z #define _IO_ftrylockfile(_fp) 
2025-05-07T20:27:09.7406707Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR
2025-05-07T20:27:09.7406801Z #define ____FILE_defined 1
2025-05-07T20:27:09.7406918Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1
2025-05-07T20:27:09.7407019Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:27:09.7407105Z #define _ISOC99_SOURCE 1
2025-05-07T20:27:09.7407198Z #define __VECTOR_FUNCTIONS_H__ 
2025-05-07T20:27:09.7407468Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias))
2025-05-07T20:27:09.7407600Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 
2025-05-07T20:27:09.7407692Z #define _IO_RIGHT 04
2025-05-07T20:27:09.7407783Z #define __END_NAMESPACE_STD 
2025-05-07T20:27:09.7407971Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:27:09.7408074Z #define _GLIBCXX_STD_C std
2025-05-07T20:27:09.7408192Z #define cudaInitDeviceFlagsAreValid 0x01
2025-05-07T20:27:09.7408287Z #define _LARGEFILE64_SOURCE 1
2025-05-07T20:27:09.7408397Z #define _GLIBCXX_USE_C99_STDINT_TR1 1
2025-05-07T20:27:09.7408476Z #define _STDDEF_H_ 
2025-05-07T20:27:09.7408654Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:27:09.7408757Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:27:09.7408874Z #define isalnum_l(c,l) __isalnum_l ((c), (l))
2025-05-07T20:27:09.7409080Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0)
2025-05-07T20:27:09.7409191Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:09.7409334Z #define __glibcxx_requires_irreflexive(_First,_Last) 
2025-05-07T20:27:09.7409461Z #define cudaGraphKernelNodePortDefault 0
2025-05-07T20:27:09.7409562Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:27:09.7409675Z #define __cudaCDP2Memcpy3DAsync_ptsz 
2025-05-07T20:27:09.7409777Z #define __PID_T_TYPE __S32_TYPE
2025-05-07T20:27:09.7409887Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:27:09.7409983Z #define CHARCLASS_NAME_MAX 2048
2025-05-07T20:27:09.7410080Z #define _GLIBCXX_HAVE_TANF 1
2025-05-07T20:27:09.7410174Z #define _GLIBCXX_USE_ST_MTIM 1
2025-05-07T20:27:09.7410356Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:27:09.7410445Z #define __CUDA_RUNTIME_H__ 
2025-05-07T20:27:09.7410625Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status))
2025-05-07T20:27:09.7410726Z #define _GLIBCXX_HAVE_STDLIB_H 1
2025-05-07T20:27:09.7410817Z #define __STDCPP_THREADS__ 1
2025-05-07T20:27:09.7410960Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L
2025-05-07T20:27:09.7411058Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:27:09.7411149Z #define _POSIX_UIO_MAXIOV 16
2025-05-07T20:27:09.7411247Z #define _PSTL_PAR_BACKEND_SERIAL 
2025-05-07T20:27:09.7411352Z #define P_tmpdir "/tmp"
2025-05-07T20:27:09.7411470Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__
2025-05-07T20:27:09.7411562Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:27:09.7411666Z #define __WORDSIZE_TIME64_COMPAT32 1
2025-05-07T20:27:09.7411830Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__))
2025-05-07T20:27:09.7412006Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:27:09.7412196Z #define _PSTL_HIDE_FROM_ABI_PUSH 
2025-05-07T20:27:09.7412319Z #define cudaStreamLegacy ((cudaStream_t)0x1)
2025-05-07T20:27:09.7412438Z #define _IO_cleanup_region_start(_fct,_fp) 
2025-05-07T20:27:09.7412539Z #define __location__(a) __annotate__(a)
2025-05-07T20:27:09.7412770Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type)
2025-05-07T20:27:09.7412874Z #define _POSIX2_BC_BASE_MAX 99
2025-05-07T20:27:09.7412985Z #define __cudaCDP2DeviceGetAttribute 
2025-05-07T20:27:09.7413077Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:27:09.7413255Z #define __STDC_UTF_32__ 1
2025-05-07T20:27:09.7413347Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:27:09.7413449Z #define NAN (__builtin_nanf (""))
2025-05-07T20:27:09.7413543Z #define _POSIX_MQ_PRIO_MAX 32
2025-05-07T20:27:09.7413623Z #define __FXSR__ 1
2025-05-07T20:27:09.7413707Z #define _SIZE_T 
2025-05-07T20:27:09.7413807Z #define _GLIBCXX_USE_GETTIMEOFDAY 1
2025-05-07T20:27:09.7413924Z #define cudaHostRegisterReadOnly 0x08
2025-05-07T20:27:09.7414099Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:27:09.7414248Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f)
2025-05-07T20:27:09.7414339Z #define _IO_ssize_t __ssize_t
2025-05-07T20:27:09.7414444Z #define __ULONG32_TYPE unsigned int
2025-05-07T20:27:09.7414741Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:27:09.7414952Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000
2025-05-07T20:27:09.7415042Z #define _GXX_NULLPTR_T 
2025-05-07T20:27:09.7415171Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 
2025-05-07T20:27:09.7415266Z #define FOPEN_MAX 16
2025-05-07T20:27:09.7415355Z #define __BIG_ENDIAN 4321
2025-05-07T20:27:09.7415473Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:27:09.7415577Z #define __suseconds_t_defined 
2025-05-07T20:27:09.7415668Z #define __off_t_defined 
2025-05-07T20:27:09.7415751Z #define stderr stderr
2025-05-07T20:27:09.7415858Z #define M_LOG10E 0.43429448190325182765
2025-05-07T20:27:09.7415970Z #define __glibcxx_requires_string(_String) 
2025-05-07T20:27:09.7416066Z #define _GLIBCXX_HAVE_LDEXPL 1
2025-05-07T20:27:09.7417939Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:27:09.7418348Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304)
2025-05-07T20:27:09.7418443Z #define __mode_t_defined 
2025-05-07T20:27:09.7418524Z #define _GCC_SIZE_T 
2025-05-07T20:27:09.7418621Z #define __INO64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:09.7418732Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:27:09.7418837Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:27:09.7418929Z #define __USE_XOPEN2K8XSI 1
2025-05-07T20:27:09.7419027Z #define __UINT32_C(c) c ## U
2025-05-07T20:27:09.7419132Z #define __cpp_alias_templates 200704L
2025-05-07T20:27:09.7419236Z #define cudaHostAllocMapped 0x02
2025-05-07T20:27:09.7419349Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 
2025-05-07T20:27:09.7419438Z #define _STL_ITERATOR_H 1
2025-05-07T20:27:09.7419528Z #define __size_t__ 
2025-05-07T20:27:09.7419659Z #define cudaStreamAttrID cudaLaunchAttributeID
2025-05-07T20:27:09.7419754Z #define _GLIBCXX_HAVE_ATANF 1
2025-05-07T20:27:09.7419870Z #define cudaEventRecordExternal 0x01
2025-05-07T20:27:09.7420020Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l))
2025-05-07T20:27:09.7420111Z #define _IO_BUFSIZ _G_BUFSIZ
2025-05-07T20:27:09.7420300Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:27:09.7420385Z #define _ENDIAN_H 1
2025-05-07T20:27:09.7420504Z #define __builtin_align__(a) __align__(a)
2025-05-07T20:27:09.7420599Z #define _GLIBCXX20_CONSTEXPR 
2025-05-07T20:27:09.7420700Z #define __NV_NO_HOST_COMPILER_CHECK 1
2025-05-07T20:27:09.7420788Z #define __try try
2025-05-07T20:27:09.7420888Z #define _GLIBCXX_HAVE_FINITE 1
2025-05-07T20:27:09.7420983Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:27:09.7421077Z #define __INT8_MAX__ 0x7f
2025-05-07T20:27:09.7421466Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2)
2025-05-07T20:27:09.7421556Z #define __LONG_WIDTH__ 64
2025-05-07T20:27:09.7421642Z #define __PIC__ 2
2025-05-07T20:27:09.7421758Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX
2025-05-07T20:27:09.7421877Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:27:09.7422014Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp)
2025-05-07T20:27:09.7422111Z #define _GLIBCXX_HAVE_FLOAT_H 1
2025-05-07T20:27:09.7422210Z #define _GLIBCXX_HAVE_ATANL 1
2025-05-07T20:27:09.7422480Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:27:09.7422606Z #define __DEVICE_FUNCTIONS_HPP__ 
2025-05-07T20:27:09.7422721Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:27:09.7422824Z #define _IO_uid_t __uid_t
2025-05-07T20:27:09.7422921Z #define _GLIBCXX_HAVE_READLINK 1
2025-05-07T20:27:09.7423055Z #define __cudaCDP2EventRecordWithFlags_ptsz 
2025-05-07T20:27:09.7423152Z #define _CONCEPT_CHECK_H 1
2025-05-07T20:27:09.7423298Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:27:09.7423407Z #define _GLIBCXX_HAVE_NETINET_IN_H 1
2025-05-07T20:27:09.7423535Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1
2025-05-07T20:27:09.7423627Z #define LONG_BIT 64
2025-05-07T20:27:09.7423735Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4
2025-05-07T20:27:09.7423835Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1
2025-05-07T20:27:09.7423971Z #define __cpp_lib_math_special_functions 201603L
2025-05-07T20:27:09.7424065Z #define __fsfilcnt_t_defined 
2025-05-07T20:27:09.7424160Z #define __blkcnt_t_defined 
2025-05-07T20:27:09.7424442Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:27:09.7424533Z #define __USE_LARGEFILE 1
2025-05-07T20:27:09.7424630Z #define __cpp_constexpr 201603L
2025-05-07T20:27:09.7424731Z #define CUDART_VERSION 12060
2025-05-07T20:27:09.7424819Z #define NL_TEXTMAX INT_MAX
2025-05-07T20:27:09.7424924Z #define cudaDeviceMapHost 0x08
2025-05-07T20:27:09.7425017Z #define _GLIBCXX_CMATH 1
2025-05-07T20:27:09.7425215Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x)))
2025-05-07T20:27:09.7425313Z #define __lldiv_t_defined 1
2025-05-07T20:27:09.7425641Z #define __SSE2__ 1
2025-05-07T20:27:09.7425781Z #define _IOLBF 1
2025-05-07T20:27:09.7425903Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1
2025-05-07T20:27:09.7425999Z #define _GLIBCXX_HAVE_FLOORF 1
2025-05-07T20:27:09.7426105Z #define __cpp_deduction_guides 201703L
2025-05-07T20:27:09.7426205Z #define _GLIBCXX_HAVE_EXPF 1
2025-05-07T20:27:09.7426323Z #define __annotate__(a) __attribute__((a))
2025-05-07T20:27:09.7426411Z #define __INT32_TYPE__ int
2025-05-07T20:27:09.7426506Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:27:09.7426612Z #define cudaDeviceSyncMemops 0x80
2025-05-07T20:27:09.7426712Z #define __cpp_exceptions 199711L
2025-05-07T20:27:09.7426820Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:27:09.7426931Z #define cudaDeviceScheduleYield 0x02
2025-05-07T20:27:09.7427035Z #define _SYS_SYSMACROS_H 1
2025-05-07T20:27:09.7427151Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1
2025-05-07T20:27:09.7427315Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:27:09.7427421Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:27:09.7427519Z #define __SWORD_TYPE long int
2025-05-07T20:27:09.7427613Z #define __INTMAX_TYPE__ long int
2025-05-07T20:27:09.7427719Z #define _GLIBCXX11_USE_C99_MATH 1
2025-05-07T20:27:09.7427813Z #define __PTHREAD_SPINS 0, 0
2025-05-07T20:27:09.7427904Z #define _BITS_POSIX1_LIM_H 1
2025-05-07T20:27:09.7428200Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:27:09.7428295Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:27:09.7428453Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT)
2025-05-07T20:27:09.7428535Z #define _T_SIZE 
2025-05-07T20:27:09.7428640Z #define cudaHostAllocDefault 0x00
2025-05-07T20:27:09.7428773Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 
2025-05-07T20:27:09.7429183Z #define __va_arg_pack() __builtin_va_arg_pack ()
2025-05-07T20:27:09.7429282Z #define _POSIX_TIMER_MAX 32
2025-05-07T20:27:09.7429382Z #define _GLIBCXX_HAVE_TLS 1
2025-05-07T20:27:09.7429503Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT
2025-05-07T20:27:09.7429602Z #define _GLIBCXX_HAVE_ACOSL 1
2025-05-07T20:27:09.7429706Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:27:09.7429795Z #define __ATOMIC_CONSUME 1
2025-05-07T20:27:09.7429979Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT
2025-05-07T20:27:09.7430067Z #define __GNUC_MINOR__ 4
2025-05-07T20:27:09.7430307Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:27:09.7430408Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:27:09.7430525Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:09.7430611Z #define __PIE__ 2
2025-05-07T20:27:09.7430720Z #define LITTLE_ENDIAN __LITTLE_ENDIAN
2025-05-07T20:27:09.7430821Z #define _GLIBCXX_HAVE_INT64_T_LONG 1
2025-05-07T20:27:09.7431019Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:27:09.7431252Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:27:09.7431343Z #define __nlink_t_defined 
2025-05-07T20:27:09.7431472Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]]
2025-05-07T20:27:09.7431593Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x)
2025-05-07T20:27:09.7431679Z #define _XOPEN_LIM_H 1
2025-05-07T20:27:09.7431955Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:27:09.7432075Z #define __cpp_template_template_args 201611L
2025-05-07T20:27:09.7432186Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1
2025-05-07T20:27:09.7432299Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX
2025-05-07T20:27:09.7432406Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:27:09.7432507Z #define __FILE_defined 1
2025-05-07T20:27:09.7432718Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:27:09.7432816Z #define _GLIBCXX_HAVE_SINCOS 1
2025-05-07T20:27:09.7432917Z #define __USE_XOPEN_EXTENDED 1
2025-05-07T20:27:09.7433033Z #define __cpp_lib_tuple_element_t 201402L
2025-05-07T20:27:09.7433153Z #define isascii_l(c,l) __isascii_l ((c), (l))
2025-05-07T20:27:09.7433271Z #define cudaInvalidDeviceId ((int)-2)
2025-05-07T20:27:09.7433374Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1
2025-05-07T20:27:09.7433460Z #define __INT16_C(c) c
2025-05-07T20:27:09.7433565Z #define __U32_TYPE unsigned int
2025-05-07T20:27:09.7433665Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1
2025-05-07T20:27:09.7433789Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp)
2025-05-07T20:27:09.7433883Z #define __STDC__ 1
2025-05-07T20:27:09.7433980Z #define _GLIBCXX_HAVE_VWSCANF 1
2025-05-07T20:27:09.7434080Z #define _GLIBCXX_HAVE_EXECINFO_H 1
2025-05-07T20:27:09.7434185Z #define _GLIBCXX_USE_REALPATH 1
2025-05-07T20:27:09.7434339Z #define __attribute_malloc__ __attribute__ ((__malloc__))
2025-05-07T20:27:09.7434436Z #define __FLT32X_DIG__ 15
2025-05-07T20:27:09.7434542Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1
2025-05-07T20:27:09.7434639Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:27:09.7434761Z #define cudaArrayDeferredMapping 0x80
2025-05-07T20:27:09.7434875Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 
2025-05-07T20:27:09.7434974Z #define USHRT_MAX (SHRT_MAX * 2 + 1)
2025-05-07T20:27:09.7435087Z #define __cpp_lib_is_swappable 201603
2025-05-07T20:27:09.7435173Z #define stdin stdin
2025-05-07T20:27:09.7435263Z #define __ino64_t_defined 
2025-05-07T20:27:09.7435355Z #define STA_CLK 0x8000
2025-05-07T20:27:09.7435453Z #define __clockid_t_defined 1
2025-05-07T20:27:09.7435609Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__)
2025-05-07T20:27:09.7435783Z #define __attribute_noinline__ __attribute__ ((__noinline__))
2025-05-07T20:27:09.7435891Z #define __cudaCDP2MemsetAsync 
2025-05-07T20:27:09.7435998Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 
2025-05-07T20:27:09.7436103Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 
2025-05-07T20:27:09.7436211Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1
2025-05-07T20:27:09.7436512Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d)))
2025-05-07T20:27:09.7436606Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:27:09.7437141Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; }))
2025-05-07T20:27:09.7437237Z #define DOMAIN 1
2025-05-07T20:27:09.7437330Z #define M_LN2 0.69314718055994530942
2025-05-07T20:27:09.7437416Z #define __NVCC__ 1
2025-05-07T20:27:09.7437606Z #define __cudaCDP2Memset2DAsync 
2025-05-07T20:27:09.7437725Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:09.7437838Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 
2025-05-07T20:27:09.7437944Z #define __throw_exception_again throw
2025-05-07T20:27:09.7438041Z #define M_SQRT2 1.41421356237309504880
2025-05-07T20:27:09.7438142Z #define __EXCEPTION_H 1
2025-05-07T20:27:09.7438248Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:27:09.7438354Z #define HUGE_VAL (__builtin_huge_val())
2025-05-07T20:27:09.7438775Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:27:09.7438897Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:27:09.7439011Z #define _GLIBCXX_INLINE_VERSION 0
2025-05-07T20:27:09.7439112Z #define _GLIBCXX_USE_INT128 1
2025-05-07T20:27:09.7439220Z #define __cpp_lib_bool_constant 201505
2025-05-07T20:27:09.7439326Z #define PTHREAD_KEYS_MAX 1024
2025-05-07T20:27:09.7439469Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:27:09.7439582Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:09.7439701Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1
2025-05-07T20:27:09.7439797Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:27:09.7439902Z #define __cpp_lib_tuples_by_type 201304
2025-05-07T20:27:09.7440007Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:27:09.7440110Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:27:09.7440251Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC))
2025-05-07T20:27:09.7440354Z #define __useconds_t_defined 
2025-05-07T20:27:09.7440456Z #define _GLIBCXX_USE_SCHED_YIELD 1
2025-05-07T20:27:09.7440647Z #define __attribute_deprecated__ __attribute__ ((__deprecated__))
2025-05-07T20:27:09.7440798Z #define __cpp_lib_type_trait_variable_templates 201510L
2025-05-07T20:27:09.7440887Z #define __SSE_MATH__ 1
2025-05-07T20:27:09.7440984Z #define _IO_wint_t wint_t
2025-05-07T20:27:09.7441079Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:27:09.7441169Z #define _GLIBCXX_VERBOSE 1
2025-05-07T20:27:09.7441277Z #define _GLIBCXX_HAVE_ASINF 1
2025-05-07T20:27:09.7441392Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:27:09.7441490Z #define _GLIBCXX_HAVE_ISINFL 1
2025-05-07T20:27:09.7441589Z #define _GLIBCXX_HAVE_ASINL 1
2025-05-07T20:27:09.7441675Z #define __USE_ATFILE 1
2025-05-07T20:27:09.7441771Z #define _POSIX_OPEN_MAX 20
2025-05-07T20:27:09.7441866Z #define _POSIX_LOGIN_NAME_MAX 9
2025-05-07T20:27:09.7441959Z #define _GCC_PTRDIFF_T 
2025-05-07T20:27:09.7442198Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority
2025-05-07T20:27:09.7442297Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:27:09.7442397Z #define _POSIX_THREAD_KEYS_MAX 128
2025-05-07T20:27:09.7442507Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:27:09.7442617Z #define __cpp_lib_array_constexpr 201803L
2025-05-07T20:27:09.7442704Z #define _STDLIB_H 1
2025-05-07T20:27:09.7442851Z #define __exctype(name) extern int name (int) __THROW
2025-05-07T20:27:09.7442950Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:27:09.7443046Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:27:09.7443183Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:09.7443293Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:27:09.7443397Z #define __SM_61_INTRINSICS_H__ 
2025-05-07T20:27:09.7443583Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused"
2025-05-07T20:27:09.7443838Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l))
2025-05-07T20:27:09.7443956Z #define __glibcxx_requires_nonempty() 
2025-05-07T20:27:09.7444075Z #define w_stopsig __wait_stopped.__w_stopsig
2025-05-07T20:27:09.7444170Z #define __ldiv_t_defined 1
2025-05-07T20:27:09.7444360Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 
2025-05-07T20:27:09.7444460Z #define ___int_ptrdiff_t_h 
2025-05-07T20:27:09.7444630Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:27:09.7444745Z #define __cudaCDP2EventDestroy 
2025-05-07T20:27:09.7444837Z #define __HOST_DEFINES_H__ 
2025-05-07T20:27:09.7445132Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:27:09.7445238Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:09.7445343Z #define _GLIBCXX_USE_NANOSLEEP 1
2025-05-07T20:27:09.7445438Z #define CUDART_CB 
2025-05-07T20:27:09.7445545Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX
2025-05-07T20:27:09.7445674Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1
2025-05-07T20:27:09.7445782Z #define MB_LEN_MAX 16
2025-05-07T20:27:09.7446011Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:27:09.7446119Z #define _GLIBCXX11_USE_C99_WCHAR 1
2025-05-07T20:27:09.7446260Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp)
2025-05-07T20:27:09.7446378Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1
2025-05-07T20:27:09.7446489Z #define _GLIBCXX_HAVE_UNISTD_H 1
2025-05-07T20:27:09.7446645Z #define __glibc_likely(cond) __builtin_expect((cond), 1)
2025-05-07T20:27:09.7446761Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:27:09.7446857Z #define _GNU_SOURCE 1
2025-05-07T20:27:09.7446954Z #define __stub_putmsg 
2025-05-07T20:27:09.7447042Z #define __CUDACC__ 1
2025-05-07T20:27:09.7447141Z #define __N(msgid) (msgid)
2025-05-07T20:27:09.7447230Z #define __P(args) args
2025-05-07T20:27:09.7447489Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative
2025-05-07T20:27:09.7447600Z #define __cpp_init_captures 201304L
2025-05-07T20:27:09.7447716Z #define _GLIBCXX17_CONSTEXPR constexpr
2025-05-07T20:27:09.7447809Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:27:09.7447917Z #define __cpp_lib_as_const 201510
2025-05-07T20:27:09.7448003Z #define __WCHAR_T 
2025-05-07T20:27:09.7448103Z #define __ATOMIC_RELEASE 3
2025-05-07T20:27:09.7448202Z #define __fsblkcnt_t_defined 
2025-05-07T20:27:09.7448325Z #define __cudaCDP2EventCreateWithFlags 
2025-05-07T20:27:09.7448442Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 
2025-05-07T20:27:09.7448449Z 
2025-05-07T20:27:09.7692742Z 
2025-05-07T20:27:09.7693452Z + conda run -n build_binary nvcc --version
2025-05-07T20:27:09.7693501Z 
2025-05-07T20:27:11.6739684Z nvcc: NVIDIA (R) Cuda compiler driver
2025-05-07T20:27:11.6740067Z Copyright (c) 2005-2024 NVIDIA Corporation
2025-05-07T20:27:11.6740393Z Built on Tue_Oct_29_23:50:19_PDT_2024
2025-05-07T20:27:11.6740708Z Cuda compilation tools, release 12.6, V12.6.85
2025-05-07T20:27:11.6741051Z Build cuda_12.6.r12.6/compiler.35059454_0
2025-05-07T20:27:11.6741260Z 
2025-05-07T20:27:11.7449183Z 
2025-05-07T20:27:11.7459726Z /usr/bin/nvidia-smi
2025-05-07T20:27:11.7464889Z + nvidia-smi
2025-05-07T20:27:11.7465178Z 
2025-05-07T20:27:11.7641046Z Wed May  7 20:27:11 2025       
2025-05-07T20:27:11.7641438Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:11.7641954Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:27:11.7642460Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:27:11.7642960Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:27:11.7643538Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:27:11.7643981Z |                                         |                        |               MIG M. |
2025-05-07T20:27:11.7644318Z |=========================================+========================+======================|
2025-05-07T20:27:11.7811386Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:27:11.7811874Z |  0%   27C    P8             16W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:27:11.7812272Z |                                         |                        |                  N/A |
2025-05-07T20:27:11.7812677Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:27:11.7815734Z                                                                                          
2025-05-07T20:27:11.7816458Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:11.7816900Z | Processes:                                                                              |
2025-05-07T20:27:11.7817354Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:27:11.7817782Z |        ID   ID                                                               Usage      |
2025-05-07T20:27:11.7818136Z |=========================================================================================|
2025-05-07T20:27:11.7822390Z |  No running processes found                                                             |
2025-05-07T20:27:11.7822972Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:12.0532939Z 
2025-05-07T20:27:12.0537937Z [INSTALL] Successfully installed CUDA 12.6.3
2025-05-07T20:27:12.0599120Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3
2025-05-07T20:27:12.0599706Z [36;1m. $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3[0m
2025-05-07T20:27:12.0611722Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:27:12.0612088Z env:
2025-05-07T20:27:12.0612326Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:27:12.0612637Z   BUILD_ENV: build_binary
2025-05-07T20:27:12.0612902Z   BUILD_TARGET: genai
2025-05-07T20:27:12.0613182Z   BUILD_VARIANT: cuda
2025-05-07T20:27:12.0613437Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:27:12.0613700Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:27:12.0614013Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:27:12.0614364Z ##[endgroup]
2025-05-07T20:27:12.4027960Z ################################################################################
2025-05-07T20:27:12.4028510Z # Install PyTorch (PIP)
2025-05-07T20:27:12.4028804Z #
2025-05-07T20:27:12.4044457Z # [2025-05-07T20:27:12.404Z] + install_pytorch_pip build_binary nightly cuda/12.6.3
2025-05-07T20:27:12.4044976Z ################################################################################
2025-05-07T20:27:12.4045230Z 
2025-05-07T20:27:12.4074321Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y numpy
2025-05-07T20:27:13.3965994Z Channels:
2025-05-07T20:27:13.3966388Z  - conda-forge
2025-05-07T20:27:13.3966753Z Platform: linux-64
2025-05-07T20:27:16.8609194Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:27:17.5989106Z Solving environment: \ | / done
2025-05-07T20:27:17.8163379Z 
2025-05-07T20:27:17.8163685Z ## Package Plan ##
2025-05-07T20:27:17.8163927Z 
2025-05-07T20:27:17.8164209Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:27:17.8164673Z 
2025-05-07T20:27:17.8164833Z   added / updated specs:
2025-05-07T20:27:17.8165084Z     - numpy
2025-05-07T20:27:17.8165201Z 
2025-05-07T20:27:17.8165229Z 
2025-05-07T20:27:17.8165352Z The following packages will be downloaded:
2025-05-07T20:27:17.8165587Z 
2025-05-07T20:27:17.8165701Z     package                    |            build
2025-05-07T20:27:17.8166035Z     ---------------------------|-----------------
2025-05-07T20:27:17.8166417Z     libblas-3.9.0              |31_h59b9bed_openblas          16 KB  conda-forge
2025-05-07T20:27:17.8166878Z     libcblas-3.9.0             |31_he106b2a_openblas          16 KB  conda-forge
2025-05-07T20:27:17.8167408Z     libgfortran-15.1.0         |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:27:17.8168081Z     libgfortran5-15.1.0        |       hcea5267_2         1.5 MB  conda-forge
2025-05-07T20:27:17.8168729Z     liblapack-3.9.0            |31_h7ac8fdf_openblas          16 KB  conda-forge
2025-05-07T20:27:17.8169214Z     libopenblas-0.3.29         |pthreads_h94d23a6_0         5.6 MB  conda-forge
2025-05-07T20:27:17.8169677Z     numpy-2.2.5                |  py312h72c5963_0         8.1 MB  conda-forge
2025-05-07T20:27:17.8170071Z     ------------------------------------------------------------
2025-05-07T20:27:17.8170723Z                                            Total:        15.4 MB
2025-05-07T20:27:17.8170945Z 
2025-05-07T20:27:17.8171072Z The following NEW packages will be INSTALLED:
2025-05-07T20:27:17.8171302Z 
2025-05-07T20:27:17.8171526Z   libblas            conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 
2025-05-07T20:27:17.8172032Z   libcblas           conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 
2025-05-07T20:27:17.8172555Z   libgfortran        conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 
2025-05-07T20:27:17.8173073Z   libgfortran5       conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 
2025-05-07T20:27:17.8173608Z   liblapack          conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 
2025-05-07T20:27:17.8174157Z   libopenblas        conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 
2025-05-07T20:27:17.8175018Z   numpy              conda-forge/linux-64::numpy-2.2.5-py312h72c5963_0 
2025-05-07T20:27:17.8175313Z 
2025-05-07T20:27:17.8175317Z 
2025-05-07T20:27:17.8175321Z 
2025-05-07T20:27:17.8175462Z Downloading and Extracting Packages: ...working...
2025-05-07T20:27:17.8175837Z numpy-2.2.5          | 8.1 MB    |            |   0% 
2025-05-07T20:27:17.8176063Z 
2025-05-07T20:27:17.8183864Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:27:17.8184213Z 
2025-05-07T20:27:17.8184219Z 
2025-05-07T20:27:17.8200070Z libgfortran5-15.1.0  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:27:17.8200454Z 
2025-05-07T20:27:17.8200463Z 
2025-05-07T20:27:17.8200991Z 
2025-05-07T20:27:17.8216841Z libgfortran-15.1.0   | 34 KB     |            |   0% [A[A[A
2025-05-07T20:27:17.8217125Z 
2025-05-07T20:27:17.8217132Z 
2025-05-07T20:27:17.8217136Z 
2025-05-07T20:27:17.8227662Z 
2025-05-07T20:27:17.8239576Z libblas-3.9.0        | 16 KB     |            |   0% [A[A[A[A
2025-05-07T20:27:17.8239956Z 
2025-05-07T20:27:17.8239978Z 
2025-05-07T20:27:17.8239984Z 
2025-05-07T20:27:17.8239993Z 
2025-05-07T20:27:17.8240012Z 
2025-05-07T20:27:17.8241133Z libcblas-3.9.0       | 16 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:27:17.8241531Z 
2025-05-07T20:27:17.8241537Z 
2025-05-07T20:27:17.8241542Z 
2025-05-07T20:27:17.8241547Z 
2025-05-07T20:27:17.8241552Z 
2025-05-07T20:27:17.8241561Z 
2025-05-07T20:27:17.9487254Z liblapack-3.9.0      | 16 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:27:17.9487694Z 
2025-05-07T20:27:17.9487700Z 
2025-05-07T20:27:17.9487705Z 
2025-05-07T20:27:17.9605060Z 
2025-05-07T20:27:17.9658937Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:17.9659335Z 
2025-05-07T20:27:17.9659340Z 
2025-05-07T20:27:17.9659346Z 
2025-05-07T20:27:17.9659351Z 
2025-05-07T20:27:17.9662509Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:17.9662893Z 
2025-05-07T20:27:17.9662898Z 
2025-05-07T20:27:17.9662905Z 
2025-05-07T20:27:18.0208179Z libgfortran-15.1.0   | 34 KB     | ####7      |  47% [A[A[A
2025-05-07T20:27:18.0208547Z 
2025-05-07T20:27:18.0208551Z 
2025-05-07T20:27:18.0210820Z 
2025-05-07T20:27:18.0821971Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:27:18.0822280Z 
2025-05-07T20:27:18.0822284Z 
2025-05-07T20:27:18.0822287Z 
2025-05-07T20:27:18.0822291Z 
2025-05-07T20:27:18.0822851Z 
2025-05-07T20:27:18.0835035Z libcblas-3.9.0       | 16 KB     | #########7 |  98% [A[A[A[A[A
2025-05-07T20:27:18.0835456Z 
2025-05-07T20:27:18.0835462Z 
2025-05-07T20:27:18.0835468Z 
2025-05-07T20:27:18.0835473Z 
2025-05-07T20:27:18.0835478Z 
2025-05-07T20:27:18.0835483Z 
2025-05-07T20:27:18.0865360Z liblapack-3.9.0      | 16 KB     | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:27:18.0865819Z 
2025-05-07T20:27:18.0865825Z 
2025-05-07T20:27:18.0865832Z 
2025-05-07T20:27:18.0865838Z 
2025-05-07T20:27:18.0865844Z 
2025-05-07T20:27:18.0892560Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:27:18.0892987Z 
2025-05-07T20:27:18.0892994Z 
2025-05-07T20:27:18.0893330Z 
2025-05-07T20:27:18.0893334Z 
2025-05-07T20:27:18.0893338Z 
2025-05-07T20:27:18.0893506Z 
2025-05-07T20:27:18.1251515Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:27:18.1251970Z 
2025-05-07T20:27:18.1251976Z 
2025-05-07T20:27:18.1251981Z 
2025-05-07T20:27:18.1251987Z 
2025-05-07T20:27:18.1266465Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:18.1266851Z 
2025-05-07T20:27:18.1266857Z 
2025-05-07T20:27:18.1266862Z 
2025-05-07T20:27:18.1266867Z 
2025-05-07T20:27:18.1266872Z 
2025-05-07T20:27:18.1327348Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:27:18.1327737Z 
2025-05-07T20:27:18.1327743Z 
2025-05-07T20:27:18.1327748Z 
2025-05-07T20:27:18.1330050Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:27:18.1330459Z 
2025-05-07T20:27:18.1330465Z 
2025-05-07T20:27:18.1330470Z 
2025-05-07T20:27:18.1365378Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:27:18.1392958Z numpy-2.2.5          | 8.1 MB    |            |   0% 
2025-05-07T20:27:18.1393300Z 
2025-05-07T20:27:18.1393305Z 
2025-05-07T20:27:18.1393310Z 
2025-05-07T20:27:18.1393326Z 
2025-05-07T20:27:18.1393331Z 
2025-05-07T20:27:18.1393336Z 
2025-05-07T20:27:18.1859382Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:27:18.1859681Z 
2025-05-07T20:27:18.1859688Z 
2025-05-07T20:27:18.1973407Z libgfortran5-15.1.0  | 1.5 MB    | 1          |   1% [A[A
2025-05-07T20:27:18.1973678Z 
2025-05-07T20:27:18.2258160Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:27:18.2258483Z 
2025-05-07T20:27:18.2259160Z 
2025-05-07T20:27:18.2365350Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:18.2710173Z numpy-2.2.5          | 8.1 MB    | #########5 |  95% 
2025-05-07T20:27:18.2784887Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:27:18.2788366Z 
2025-05-07T20:27:18.3133279Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:18.3133723Z 
2025-05-07T20:27:18.3133740Z 
2025-05-07T20:27:18.3137990Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:18.3138362Z 
2025-05-07T20:27:18.3138725Z 
2025-05-07T20:27:18.4166682Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:18.4167093Z 
2025-05-07T20:27:18.4167413Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:18.4167773Z 
2025-05-07T20:27:18.7055665Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:18.7063549Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:27:18.7064137Z                                                      
2025-05-07T20:27:18.7064450Z 
2025-05-07T20:27:18.7064735Z                                                      [A
2025-05-07T20:27:18.7065054Z 
2025-05-07T20:27:18.7065060Z 
2025-05-07T20:27:18.7065338Z                                                      [A[A
2025-05-07T20:27:18.7065679Z 
2025-05-07T20:27:18.7065685Z 
2025-05-07T20:27:18.7065690Z 
2025-05-07T20:27:18.7065948Z                                                      [A[A[A
2025-05-07T20:27:18.7066241Z 
2025-05-07T20:27:18.7066247Z 
2025-05-07T20:27:18.7066253Z 
2025-05-07T20:27:18.7066258Z 
2025-05-07T20:27:18.7066512Z                                                      [A[A[A[A
2025-05-07T20:27:18.7066734Z 
2025-05-07T20:27:18.7066738Z 
2025-05-07T20:27:18.7066741Z 
2025-05-07T20:27:18.7066745Z 
2025-05-07T20:27:18.7066748Z 
2025-05-07T20:27:18.7066928Z                                                      [A[A[A[A[A
2025-05-07T20:27:18.7067230Z 
2025-05-07T20:27:18.7067235Z 
2025-05-07T20:27:18.7067240Z 
2025-05-07T20:27:18.7067246Z 
2025-05-07T20:27:18.7067251Z 
2025-05-07T20:27:18.7067257Z 
2025-05-07T20:27:18.7067540Z                                                      [A[A[A[A[A[A done
2025-05-07T20:27:18.8069844Z Preparing transaction: \ done
2025-05-07T20:27:19.0077997Z Verifying transaction: / - done
2025-05-07T20:27:19.1089411Z Executing transaction: | done
2025-05-07T20:27:19.2947015Z ################################################################################
2025-05-07T20:27:19.2947433Z # Install Package From PyTorch PIP: torch
2025-05-07T20:27:19.2947747Z #
2025-05-07T20:27:19.2962787Z # [2025-05-07T20:27:19.295Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3
2025-05-07T20:27:19.2963296Z ################################################################################
2025-05-07T20:27:19.2963523Z 
2025-05-07T20:27:19.2978610Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:27:19.3873485Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:27:19.3873843Z ################################################################################
2025-05-07T20:27:19.3874179Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:27:19.3874463Z #
2025-05-07T20:27:19.3894096Z # [2025-05-07T20:27:19.388Z] + __prepare_pip_arguments torch nightly cuda/12.6.3
2025-05-07T20:27:19.3894871Z ################################################################################
2025-05-07T20:27:19.3895189Z 
2025-05-07T20:27:19.3918297Z [INSTALL] Extracted package (channel, version): (nightly, LATEST)
2025-05-07T20:27:19.3944069Z [INSTALL] Extracted package variant: cu126
2025-05-07T20:27:19.3961677Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:27:19.3962225Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:27:19.3970715Z [INSTALL] Extracted the full PIP package: --pre torch
2025-05-07T20:27:19.3979707Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ...
2025-05-07T20:27:19.4000658Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:29:06.3999957Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/
2025-05-07T20:29:06.4000578Z Collecting torch
2025-05-07T20:29:06.4001509Z   Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (30 kB)
2025-05-07T20:29:06.4002530Z Collecting filelock (from torch)
2025-05-07T20:29:06.4003230Z   Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB)
2025-05-07T20:29:06.4004575Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from torch) (4.13.2)
2025-05-07T20:29:06.4005705Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from torch) (78.1.1)
2025-05-07T20:29:06.4006405Z Collecting sympy>=1.13.3 (from torch)
2025-05-07T20:29:06.4006934Z   Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB)
2025-05-07T20:29:06.4007795Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 42.3 MB/s eta 0:00:00
2025-05-07T20:29:06.4008153Z Collecting networkx (from torch)
2025-05-07T20:29:06.4008663Z   Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB)
2025-05-07T20:29:06.4009317Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 24.1 MB/s eta 0:00:00
2025-05-07T20:29:06.4009665Z Collecting jinja2 (from torch)
2025-05-07T20:29:06.4010151Z   Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB)
2025-05-07T20:29:06.4010667Z Collecting fsspec (from torch)
2025-05-07T20:29:06.4011160Z   Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB)
2025-05-07T20:29:06.4011751Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch)
2025-05-07T20:29:06.4012489Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB)
2025-05-07T20:29:06.4013308Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 53.1 MB/s eta 0:00:00
2025-05-07T20:29:06.4014220Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch)
2025-05-07T20:29:06.4015111Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB)
2025-05-07T20:29:06.4015939Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 10.3 MB/s eta 0:00:00
2025-05-07T20:29:06.4016437Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch)
2025-05-07T20:29:06.4017263Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB)
2025-05-07T20:29:06.4018128Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 41.5 MB/s eta 0:00:00
2025-05-07T20:29:06.4018605Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch)
2025-05-07T20:29:06.4019540Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB)
2025-05-07T20:29:06.4020375Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 37.3 MB/s eta 0:00:00
2025-05-07T20:29:06.4020774Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch)
2025-05-07T20:29:06.4021576Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB)
2025-05-07T20:29:06.4022465Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 44.6 MB/s eta 0:00:00
2025-05-07T20:29:06.4022859Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch)
2025-05-07T20:29:06.4023563Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB)
2025-05-07T20:29:06.4024349Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 141.8 MB/s eta 0:00:00
2025-05-07T20:29:06.4024744Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch)
2025-05-07T20:29:06.4025799Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB)
2025-05-07T20:29:06.4026614Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 209.1 MB/s eta 0:00:00
2025-05-07T20:29:06.4027012Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch)
2025-05-07T20:29:06.4027739Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB)
2025-05-07T20:29:06.4028549Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 148.2 MB/s eta 0:00:00
2025-05-07T20:29:06.4028948Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch)
2025-05-07T20:29:06.4029670Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB)
2025-05-07T20:29:06.4030481Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 112.6 MB/s eta 0:00:00
2025-05-07T20:29:06.4030883Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch)
2025-05-07T20:29:06.4031618Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB)
2025-05-07T20:29:06.4032432Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 162.0 MB/s eta 0:00:00
2025-05-07T20:29:06.4032808Z Collecting nvidia-nccl-cu12==2.26.2 (from torch)
2025-05-07T20:29:06.4033607Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
2025-05-07T20:29:06.4034405Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch)
2025-05-07T20:29:06.4035091Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB)
2025-05-07T20:29:06.4035792Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch)
2025-05-07T20:29:06.4036613Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB)
2025-05-07T20:29:06.4037504Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 212.3 MB/s eta 0:00:00
2025-05-07T20:29:06.4038050Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch)
2025-05-07T20:29:06.4038871Z   Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:29:06.4039711Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch)
2025-05-07T20:29:06.4040586Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:29:06.4041446Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
2025-05-07T20:29:06.4042021Z   Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB)
2025-05-07T20:29:06.4042668Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 52.2 MB/s eta 0:00:00
2025-05-07T20:29:06.4043195Z Collecting MarkupSafe>=2.0 (from jinja2->torch)
2025-05-07T20:29:06.4043993Z   Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28 kB)
2025-05-07T20:29:06.4045091Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp312-cp312-manylinux_2_28_x86_64.whl (825.4 MB)
2025-05-07T20:29:06.4045919Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.4/825.4 MB 16.7 MB/s eta 0:00:00
2025-05-07T20:29:06.4046714Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB)
2025-05-07T20:29:06.4047593Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 80.1 MB/s eta 0:00:00
2025-05-07T20:29:06.4048376Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
2025-05-07T20:29:06.4049254Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 149.8 MB/s eta 0:00:00
2025-05-07T20:29:06.4050080Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB)
2025-05-07T20:29:06.4051012Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 135.1 MB/s eta 0:00:00
2025-05-07T20:29:06.4053012Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
2025-05-07T20:29:06.4054963Z 
2025-05-07T20:29:06.4057064Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126
2025-05-07T20:29:06.4062207Z 
2025-05-07T20:29:08.6435981Z torch                    2.8.0.dev20250507+cu126
2025-05-07T20:29:08.6438484Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126)
2025-05-07T20:29:12.1665854Z [CHECK] Python (sub-)package 'torch.distributed' found ...
2025-05-07T20:29:15.6827355Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126
2025-05-07T20:29:15.6827968Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ...
2025-05-07T20:29:19.1167692Z True
2025-05-07T20:29:19.1167938Z True
2025-05-07T20:29:19.1168044Z 
2025-05-07T20:29:19.1824714Z [INSTALL] Successfully installed PyTorch through PyTorch PIP
2025-05-07T20:29:19.1871749Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi
2025-05-07T20:29:19.1872360Z [36;1mif . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi[0m
2025-05-07T20:29:19.1886770Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:19.1887136Z env:
2025-05-07T20:29:19.1887373Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:19.1887685Z   BUILD_ENV: build_binary
2025-05-07T20:29:19.1887944Z   BUILD_TARGET: genai
2025-05-07T20:29:19.1888188Z   BUILD_VARIANT: cuda
2025-05-07T20:29:19.1888430Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:29:19.1888723Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:19.1889082Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:19.1889428Z ##[endgroup]
2025-05-07T20:29:19.5287342Z /home/ec2-user/miniconda/bin/conda
2025-05-07T20:29:19.5289439Z ################################################################################
2025-05-07T20:29:19.5289958Z # Collect PyTorch Environment Information (for Reporting Issues)
2025-05-07T20:29:19.5290351Z #
2025-05-07T20:29:19.5305224Z # [2025-05-07T20:29:19.530Z] + collect_pytorch_env_info build_binary
2025-05-07T20:29:19.5305645Z ################################################################################
2025-05-07T20:29:19.5305876Z 
2025-05-07T20:29:19.5322213Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:29:19.6259414Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:29:19.6270285Z [INFO] Downloading the PyTorch environment info collection script ...
2025-05-07T20:29:19.6271008Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
2025-05-07T20:29:19.6271423Z 
2025-05-07T20:29:19.7131240Z 
2025-05-07T20:29:19.7132255Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ...
2025-05-07T20:29:19.7156057Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python collect_env.py
2025-05-07T20:29:25.7713902Z Collecting environment information...
2025-05-07T20:29:25.7714361Z PyTorch version: 2.8.0.dev20250507+cu126
2025-05-07T20:29:25.7714655Z Is debug build: False
2025-05-07T20:29:25.7714905Z CUDA used to build PyTorch: 12.6
2025-05-07T20:29:25.7715189Z ROCM used to build PyTorch: N/A
2025-05-07T20:29:25.7715366Z 
2025-05-07T20:29:25.7715468Z OS: Amazon Linux 2023.6.20250317 (x86_64)
2025-05-07T20:29:25.7715797Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:29:25.7716127Z Clang version: Could not collect
2025-05-07T20:29:25.7716398Z CMake version: Could not collect
2025-05-07T20:29:25.7716669Z Libc version: glibc-2.34
2025-05-07T20:29:25.7716831Z 
2025-05-07T20:29:25.7717143Z Python version: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0] (64-bit runtime)
2025-05-07T20:29:25.7717777Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34
2025-05-07T20:29:25.7718195Z Is CUDA available: True
2025-05-07T20:29:25.7718844Z CUDA runtime version: 12.6.85
2025-05-07T20:29:25.7719129Z CUDA_MODULE_LOADING set to: LAZY
2025-05-07T20:29:25.7719440Z GPU models and configuration: GPU 0: NVIDIA A10G
2025-05-07T20:29:25.7719784Z Nvidia driver version: 570.133.07
2025-05-07T20:29:25.7720073Z cuDNN version: Could not collect
2025-05-07T20:29:25.7720348Z HIP runtime version: N/A
2025-05-07T20:29:25.7720606Z MIOpen runtime version: N/A
2025-05-07T20:29:25.7720875Z Is XNNPACK available: True
2025-05-07T20:29:25.7721038Z 
2025-05-07T20:29:25.7721124Z CPU:
2025-05-07T20:29:25.7721339Z Architecture:                         x86_64
2025-05-07T20:29:25.7721685Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:29:25.7722094Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:29:25.7722489Z Byte Order:                           Little Endian
2025-05-07T20:29:25.7722824Z CPU(s):                               16
2025-05-07T20:29:25.7723132Z On-line CPU(s) list:                  0-15
2025-05-07T20:29:25.7723690Z Vendor ID:                            AuthenticAMD
2025-05-07T20:29:25.7724039Z Model name:                           AMD EPYC 7R32
2025-05-07T20:29:25.7724374Z CPU family:                           23
2025-05-07T20:29:25.7724666Z Model:                                49
2025-05-07T20:29:25.7724955Z Thread(s) per core:                   2
2025-05-07T20:29:25.7725251Z Core(s) per socket:                   8
2025-05-07T20:29:25.7725817Z Socket(s):                            1
2025-05-07T20:29:25.7726099Z Stepping:                             0
2025-05-07T20:29:25.7726402Z BogoMIPS:                             5599.29
2025-05-07T20:29:25.7728612Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:29:25.7730833Z Hypervisor vendor:                    KVM
2025-05-07T20:29:25.7731152Z Virtualization type:                  full
2025-05-07T20:29:25.7731496Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:29:25.7731876Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:29:25.7732254Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:29:25.7732620Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:29:25.7732946Z NUMA node(s):                         1
2025-05-07T20:29:25.7733250Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:29:25.7733592Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:29:25.7733974Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:29:25.7734351Z Vulnerability L1tf:                   Not affected
2025-05-07T20:29:25.7734822Z Vulnerability Mds:                    Not affected
2025-05-07T20:29:25.7735181Z Vulnerability Meltdown:               Not affected
2025-05-07T20:29:25.7735551Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:29:25.7735927Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:29:25.7736487Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:29:25.7737079Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:29:25.7737636Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:29:25.7738345Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:29:25.7739229Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:29:25.7740083Z Vulnerability Srbds:                  Not affected
2025-05-07T20:29:25.7740462Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:29:25.7740700Z 
2025-05-07T20:29:25.7740812Z Versions of relevant libraries:
2025-05-07T20:29:25.7741080Z [pip3] numpy==2.2.5
2025-05-07T20:29:25.7741331Z [pip3] nvidia-cublas-cu12==12.6.4.1
2025-05-07T20:29:25.7741646Z [pip3] nvidia-cuda-cupti-cu12==12.6.80
2025-05-07T20:29:25.7741964Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77
2025-05-07T20:29:25.7742288Z [pip3] nvidia-cuda-runtime-cu12==12.6.77
2025-05-07T20:29:25.7742627Z [pip3] nvidia-cudnn-cu12==9.5.1.17
2025-05-07T20:29:25.7742927Z [pip3] nvidia-cufft-cu12==11.3.0.4
2025-05-07T20:29:25.7743222Z [pip3] nvidia-curand-cu12==10.3.7.77
2025-05-07T20:29:25.7743532Z [pip3] nvidia-cusolver-cu12==11.7.1.2
2025-05-07T20:29:25.7743851Z [pip3] nvidia-cusparse-cu12==12.5.4.2
2025-05-07T20:29:25.7745063Z [pip3] nvidia-cusparselt-cu12==0.6.3
2025-05-07T20:29:25.7745391Z [pip3] nvidia-nccl-cu12==2.26.2
2025-05-07T20:29:25.7745685Z [pip3] nvidia-nvjitlink-cu12==12.6.85
2025-05-07T20:29:25.7745992Z [pip3] nvidia-nvtx-cu12==12.6.77
2025-05-07T20:29:25.7746292Z [pip3] pytorch-triton==3.3.0+git96316ce5
2025-05-07T20:29:25.7746607Z [pip3] torch==2.8.0.dev20250507+cu126
2025-05-07T20:29:25.7746984Z [conda] cuda-cudart               12.6.77              h5888daf_0    conda-forge
2025-05-07T20:29:25.7747489Z [conda] cuda-cudart-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:29:25.7748026Z [conda] cuda-cudart-dev_linux-64  12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:29:25.7748573Z [conda] cuda-cudart-static        12.6.77              h5888daf_0    conda-forge
2025-05-07T20:29:25.7749143Z [conda] cuda-cudart-static_linux-64 12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:29:25.7749728Z [conda] cuda-cudart_linux-64      12.6.77              h3f2d84a_0    conda-forge
2025-05-07T20:29:25.7750243Z [conda] cuda-cupti                12.6.80              hbd13f7d_0    conda-forge
2025-05-07T20:29:25.7750720Z [conda] cuda-cupti-dev            12.6.80              h5888daf_0    conda-forge
2025-05-07T20:29:25.7751224Z [conda] cuda-libraries            12.6.3               ha770c72_0    conda-forge
2025-05-07T20:29:25.7751738Z [conda] cuda-libraries-dev        12.6.3               ha770c72_0    conda-forge
2025-05-07T20:29:25.7752233Z [conda] cuda-nvrtc                12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:29:25.7752706Z [conda] cuda-nvrtc-dev            12.6.85              h5888daf_0    conda-forge
2025-05-07T20:29:25.7753182Z [conda] cuda-nvtx                 12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:29:25.7753652Z [conda] cuda-opencl               12.6.77              hbd13f7d_0    conda-forge
2025-05-07T20:29:25.7754145Z [conda] cuda-opencl-dev           12.6.77              h5888daf_0    conda-forge
2025-05-07T20:29:25.7754636Z [conda] cuda-runtime              12.6.3               ha804496_0    conda-forge
2025-05-07T20:29:25.7755114Z [conda] libcublas                 12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:29:25.7755597Z [conda] libcublas-dev             12.6.4.1             h5888daf_1    conda-forge
2025-05-07T20:29:25.7756066Z [conda] libcufft                  11.3.0.4             hbd13f7d_0    conda-forge
2025-05-07T20:29:25.7756539Z [conda] libcufft-dev              11.3.0.4             h5888daf_0    conda-forge
2025-05-07T20:29:25.7757015Z [conda] libcurand                 10.3.7.77            hbd13f7d_0    conda-forge
2025-05-07T20:29:25.7757501Z [conda] libcurand-dev             10.3.7.77            h5888daf_0    conda-forge
2025-05-07T20:29:25.7757983Z [conda] libcusolver               11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:29:25.7758475Z [conda] libcusolver-dev           11.7.1.2             h5888daf_1    conda-forge
2025-05-07T20:29:25.7758975Z [conda] libcusparse               12.5.4.2             hbd13f7d_0    conda-forge
2025-05-07T20:29:25.7759566Z [conda] libcusparse-dev           12.5.4.2             h5888daf_0    conda-forge
2025-05-07T20:29:25.7760066Z [conda] libnvjitlink              12.6.85              hbd13f7d_0    conda-forge
2025-05-07T20:29:25.7760567Z [conda] libnvjitlink-dev          12.6.85              h5888daf_0    conda-forge
2025-05-07T20:29:25.7761041Z [conda] numpy                     2.2.5           py312h72c5963_0    conda-forge
2025-05-07T20:29:25.7761504Z [conda] nvidia-cublas-cu12        12.6.4.1                 pypi_0    pypi
2025-05-07T20:29:25.7762023Z [conda] nvidia-cuda-cupti-cu12    12.6.80                  pypi_0    pypi
2025-05-07T20:29:25.7762539Z [conda] nvidia-cuda-nvrtc-cu12    12.6.77                  pypi_0    pypi
2025-05-07T20:29:25.7763055Z [conda] nvidia-cuda-runtime-cu12  12.6.77                  pypi_0    pypi
2025-05-07T20:29:25.7763570Z [conda] nvidia-cudnn-cu12         9.5.1.17                 pypi_0    pypi
2025-05-07T20:29:25.7764153Z [conda] nvidia-cufft-cu12         11.3.0.4                 pypi_0    pypi
2025-05-07T20:29:25.7764655Z [conda] nvidia-curand-cu12        10.3.7.77                pypi_0    pypi
2025-05-07T20:29:25.7765153Z [conda] nvidia-cusolver-cu12      11.7.1.2                 pypi_0    pypi
2025-05-07T20:29:25.7765667Z [conda] nvidia-cusparse-cu12      12.5.4.2                 pypi_0    pypi
2025-05-07T20:29:25.7766184Z [conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
2025-05-07T20:29:25.7766690Z [conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
2025-05-07T20:29:25.7767189Z [conda] nvidia-nvjitlink-cu12     12.6.85                  pypi_0    pypi
2025-05-07T20:29:25.7767689Z [conda] nvidia-nvtx-cu12          12.6.77                  pypi_0    pypi
2025-05-07T20:29:25.7768181Z [conda] pytorch-triton            3.3.0+git96316ce5          pypi_0    pypi
2025-05-07T20:29:25.7768656Z [conda] torch                     2.8.0.dev20250507+cu126          pypi_0    pypi
2025-05-07T20:29:25.7768945Z 
2025-05-07T20:29:25.8502914Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV
2025-05-07T20:29:25.8503591Z [36;1m. $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV[0m
2025-05-07T20:29:25.8516287Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:25.8516645Z env:
2025-05-07T20:29:25.8516872Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:25.8517172Z   BUILD_ENV: build_binary
2025-05-07T20:29:25.8517425Z   BUILD_TARGET: genai
2025-05-07T20:29:25.8517659Z   BUILD_VARIANT: cuda
2025-05-07T20:29:25.8517891Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:29:25.8518141Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:25.8518445Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:25.8518780Z ##[endgroup]
2025-05-07T20:29:26.1929420Z ################################################################################
2025-05-07T20:29:26.1929784Z # Prepare FBGEMM-GPU Build
2025-05-07T20:29:26.1930047Z #
2025-05-07T20:29:26.1945096Z # [2025-05-07T20:29:26.194Z] + prepare_fbgemm_gpu_build build_binary
2025-05-07T20:29:26.1945531Z ################################################################################
2025-05-07T20:29:26.1945756Z 
2025-05-07T20:29:26.1960802Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:29:26.2836302Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:29:26.2858957Z [BUILD] Running git submodules update ...
2025-05-07T20:29:26.2881333Z [EXEC] [ATTEMPT 0/3]    + git submodule sync
2025-05-07T20:29:26.3247072Z Synchronizing submodule url for '../external/asmjit'
2025-05-07T20:29:26.3247738Z Synchronizing submodule url for '../external/composable_kernel'
2025-05-07T20:29:26.3248295Z Synchronizing submodule url for '../external/cpuinfo'
2025-05-07T20:29:26.3248706Z Synchronizing submodule url for '../external/cutlass'
2025-05-07T20:29:26.3249543Z Synchronizing submodule url for '../external/googletest'
2025-05-07T20:29:26.3250461Z Synchronizing submodule url for '../external/hipify_torch'
2025-05-07T20:29:26.3251939Z Synchronizing submodule url for '../external/json'
2025-05-07T20:29:26.3285814Z [EXEC] [ATTEMPT 0/3]    + git submodule update --init --recursive
2025-05-07T20:29:26.3839886Z [BUILD] Installing other build dependencies ...
2025-05-07T20:29:26.3860717Z [EXEC] [ATTEMPT 0/3]    + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt
2025-05-07T20:29:28.8227272Z Collecting backports.tarfile (from -r requirements.txt (line 13))
2025-05-07T20:29:28.8409107Z   Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB)
2025-05-07T20:29:28.9558825Z Collecting build (from -r requirements.txt (line 14))
2025-05-07T20:29:28.9590259Z   Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
2025-05-07T20:29:29.1764478Z Collecting cmake (from -r requirements.txt (line 15))
2025-05-07T20:29:29.1795734Z   Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB)
2025-05-07T20:29:29.3078434Z Collecting click (from -r requirements.txt (line 16))
2025-05-07T20:29:29.3110468Z   Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
2025-05-07T20:29:29.6443770Z Collecting hypothesis (from -r requirements.txt (line 17))
2025-05-07T20:29:29.6481607Z   Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB)
2025-05-07T20:29:29.7048921Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 18)) (3.1.4)
2025-05-07T20:29:29.7062335Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 19)) (1.3.0)
2025-05-07T20:29:29.7882306Z Collecting ninja (from -r requirements.txt (line 20))
2025-05-07T20:29:29.7916841Z   Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB)
2025-05-07T20:29:29.8410328Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 21)) (2.2.5)
2025-05-07T20:29:29.8918635Z Collecting pyre-extensions (from -r requirements.txt (line 22))
2025-05-07T20:29:29.8968322Z   Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB)
2025-05-07T20:29:29.9967885Z Collecting pyyaml (from -r requirements.txt (line 23))
2025-05-07T20:29:29.9999260Z   Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
2025-05-07T20:29:30.0595691Z Collecting scikit-build (from -r requirements.txt (line 24))
2025-05-07T20:29:30.0645584Z   Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB)
2025-05-07T20:29:30.1024230Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 25)) (78.1.1)
2025-05-07T20:29:30.1543787Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26))
2025-05-07T20:29:30.1593506Z   Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB)
2025-05-07T20:29:30.2341279Z Collecting tabulate (from -r requirements.txt (line 27))
2025-05-07T20:29:30.2368880Z   Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
2025-05-07T20:29:30.3345544Z Collecting patchelf (from -r requirements.txt (line 28))
2025-05-07T20:29:30.3388984Z   Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB)
2025-05-07T20:29:30.4205339Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14))
2025-05-07T20:29:30.4237544Z   Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
2025-05-07T20:29:30.4817996Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14))
2025-05-07T20:29:30.4845582Z   Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB)
2025-05-07T20:29:30.5587807Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:29:30.5614809Z   Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:29:30.6429170Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:29:30.6460768Z   Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
2025-05-07T20:29:30.6898395Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5)
2025-05-07T20:29:30.7276123Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:29:30.7303982Z   Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
2025-05-07T20:29:30.7697566Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2)
2025-05-07T20:29:30.8331972Z Collecting distro (from scikit-build->-r requirements.txt (line 24))
2025-05-07T20:29:30.8359701Z   Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
2025-05-07T20:29:30.8837588Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1)
2025-05-07T20:29:30.9520898Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:29:30.9549032Z   Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
2025-05-07T20:29:31.0062888Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB)
2025-05-07T20:29:31.0679424Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB)
2025-05-07T20:29:31.1287444Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB)
2025-05-07T20:29:31.8138086Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 40.9 MB/s eta 0:00:00
2025-05-07T20:29:31.8170201Z Downloading click-8.1.8-py3-none-any.whl (98 kB)
2025-05-07T20:29:31.8666036Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB)
2025-05-07T20:29:31.9199044Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
2025-05-07T20:29:31.9647748Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
2025-05-07T20:29:32.0276661Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB)
2025-05-07T20:29:32.0781737Z Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (767 kB)
2025-05-07T20:29:32.1429245Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 767.5/767.5 kB 8.2 MB/s eta 0:00:00
2025-05-07T20:29:32.1479667Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB)
2025-05-07T20:29:32.1982548Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB)
2025-05-07T20:29:32.2482030Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
2025-05-07T20:29:32.2966515Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB)
2025-05-07T20:29:32.3561869Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB)
2025-05-07T20:29:32.4071258Z Downloading packaging-25.0-py3-none-any.whl (66 kB)
2025-05-07T20:29:32.4561822Z Downloading distro-1.9.0-py3-none-any.whl (20 kB)
2025-05-07T20:29:32.5044089Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
2025-05-07T20:29:32.5544422Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
2025-05-07T20:29:32.6029806Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB)
2025-05-07T20:29:32.7704454Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions
2025-05-07T20:29:35.0440491Z 
2025-05-07T20:29:35.0486403Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0
2025-05-07T20:29:35.2316330Z ################################################################################
2025-05-07T20:29:35.2316731Z # Install PyTorch (PyTorch PIP)
2025-05-07T20:29:35.2317003Z #
2025-05-07T20:29:35.2334936Z # [2025-05-07T20:29:35.233Z] + install_triton_pip build_binary
2025-05-07T20:29:35.2335340Z ################################################################################
2025-05-07T20:29:35.2335561Z 
2025-05-07T20:29:35.2335797Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ...
2025-05-07T20:29:35.2336249Z ################################################################################
2025-05-07T20:29:35.2336630Z # Install Package From PyTorch PIP: pytorch-triton
2025-05-07T20:29:35.2336966Z #
2025-05-07T20:29:35.2352029Z # [2025-05-07T20:29:35.234Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8
2025-05-07T20:29:35.2352579Z ################################################################################
2025-05-07T20:29:35.2352808Z 
2025-05-07T20:29:35.2372259Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:29:35.3239872Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:29:35.3240325Z ################################################################################
2025-05-07T20:29:35.3240673Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:29:35.3240970Z #
2025-05-07T20:29:35.3260206Z # [2025-05-07T20:29:35.325Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 
2025-05-07T20:29:35.3260714Z ################################################################################
2025-05-07T20:29:35.3260939Z 
2025-05-07T20:29:35.3309254Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8)
2025-05-07T20:29:35.3325704Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:29:35.3326264Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/
2025-05-07T20:29:35.3334929Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:29:35.3361570Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ...
2025-05-07T20:29:35.3382761Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/
2025-05-07T20:29:42.8809239Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
2025-05-07T20:29:42.8810927Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible.
2025-05-07T20:29:42.8811724Z 
2025-05-07T20:29:42.8811953Z Looking in indexes: https://download.pytorch.org/whl/nightly/
2025-05-07T20:29:42.8812386Z Collecting pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:29:42.8813222Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB)
2025-05-07T20:29:42.8814500Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB)
2025-05-07T20:29:42.8815790Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 59.2 MB/s eta 0:00:00
2025-05-07T20:29:42.8816186Z Installing collected packages: pytorch-triton
2025-05-07T20:29:42.8816539Z   Attempting uninstall: pytorch-triton
2025-05-07T20:29:42.8816936Z     Found existing installation: pytorch-triton 3.3.0+git96316ce5
2025-05-07T20:29:42.8817372Z     Uninstalling pytorch-triton-3.3.0+git96316ce5:
2025-05-07T20:29:42.8818870Z       Successfully uninstalled pytorch-triton-3.3.0+git96316ce5
2025-05-07T20:29:42.8819324Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8
2025-05-07T20:29:42.8819604Z 
2025-05-07T20:29:45.1334457Z [CHECK] Python (sub-)package 'triton' found ...
2025-05-07T20:29:45.1338343Z [CHECK] Printing out the pytorch-triton version ...
2025-05-07T20:29:47.3080231Z ################################################################################
2025-05-07T20:29:47.3080708Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0
2025-05-07T20:29:47.3081105Z ################################################################################
2025-05-07T20:29:47.3081330Z 
2025-05-07T20:29:49.3990388Z [CHECK] Python (sub-)package 'numpy' found ...
2025-05-07T20:29:51.5928999Z [CHECK] Python (sub-)package 'skbuild' found ...
2025-05-07T20:29:51.5933199Z [BUILD] Successfully ran git submodules update
2025-05-07T20:29:51.5970662Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl
2025-05-07T20:29:51.5971187Z [36;1m. $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl[0m
2025-05-07T20:29:51.5983926Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:51.5984281Z env:
2025-05-07T20:29:51.5984512Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:51.5984812Z   BUILD_ENV: build_binary
2025-05-07T20:29:51.5985058Z   BUILD_TARGET: genai
2025-05-07T20:29:51.5985291Z   BUILD_VARIANT: cuda
2025-05-07T20:29:51.5985527Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:29:51.5985778Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:51.5986078Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:51.5986415Z ##[endgroup]
2025-05-07T20:29:51.9377840Z ################################################################################
2025-05-07T20:29:51.9378246Z # Install FBGEMM-GPU from Wheel
2025-05-07T20:29:51.9378525Z #
2025-05-07T20:29:51.9394312Z # [2025-05-07T20:29:51.939Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:51.9394987Z ################################################################################
2025-05-07T20:29:51.9395213Z 
2025-05-07T20:29:51.9395590Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:51.9396362Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:51.9396717Z 
2025-05-07T20:29:51.9515750Z b58dd3e4c726c265422746de0dfe912f1de4c20c  fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:51.9518593Z 
2025-05-07T20:29:51.9519148Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:51.9519530Z 
2025-05-07T20:29:51.9649502Z e43258215d51ee2f91c736eb424ad291b450bb2c2463b8d99c2ae36a64a4ffa7  fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:51.9650804Z 
2025-05-07T20:29:51.9651151Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:51.9651530Z 
2025-05-07T20:29:51.9885860Z 616cc1b2508efed22f2eda95309a712f  fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:51.9887674Z 
2025-05-07T20:29:51.9897480Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl ...
2025-05-07T20:29:51.9919341Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:54.6861201Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl
2025-05-07T20:29:54.6862200Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5)
2025-05-07T20:29:54.6863103Z Installing collected packages: fbgemm-gpu-genai-nightly
2025-05-07T20:29:54.6863559Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7
2025-05-07T20:29:54.6864179Z 
2025-05-07T20:30:01.7024820Z ################################################################################
2025-05-07T20:30:01.7025228Z [CHECK] !!!!    INFO    !!!!
2025-05-07T20:30:01.7025856Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126
2025-05-07T20:30:01.7026301Z [CHECK] CUDA version reported by PyTorch is: 12.6
2025-05-07T20:30:01.7026635Z [CHECK]
2025-05-07T20:30:01.7026979Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU
2025-05-07T20:30:01.7027492Z [CHECK]       package channel; the package may be broken at runtime!!!
2025-05-07T20:30:01.7027896Z ################################################################################
2025-05-07T20:30:01.7028121Z 
2025-05-07T20:30:01.7028239Z [INSTALL] Checking imports and symbols ...
2025-05-07T20:30:05.7583207Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:30:09.7974550Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'.
2025-05-07T20:30:13.8271569Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'.
2025-05-07T20:30:13.8276346Z [CHECK] Printing out the FBGEMM-GPU version ...
2025-05-07T20:30:25.9001684Z ################################################################################
2025-05-07T20:30:25.9005093Z [CHECK] The installed FBGEMM TARGET is: genai
2025-05-07T20:30:25.9005480Z [CHECK] The installed FBGEMM VARIANT is: cuda
2025-05-07T20:30:25.9005845Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7
2025-05-07T20:30:25.9006204Z ################################################################################
2025-05-07T20:30:25.9006431Z 
2025-05-07T20:30:33.9774976Z ################################################################################
2025-05-07T20:30:33.9775716Z [CHECK] FBGEMM_GPU Experimental Packages
2025-05-07T20:30:33.9777593Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils']
2025-05-07T20:30:33.9779229Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']
2025-05-07T20:30:33.9779775Z ################################################################################
2025-05-07T20:30:33.9780005Z 
2025-05-07T20:30:33.9780170Z [INSTALL] Check for installation of Python sources ...
2025-05-07T20:30:38.0356945Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ...
2025-05-07T20:30:42.0902587Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ...
2025-05-07T20:30:46.2409266Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ...
2025-05-07T20:30:50.2797485Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ...
2025-05-07T20:30:50.2802521Z [INSTALL] Check for operator registrations ...
2025-05-07T20:30:54.2612897Z fbgemm.nccl_init
2025-05-07T20:30:54.2615078Z 
2025-05-07T20:30:54.3252338Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init
2025-05-07T20:30:58.3139930Z fbgemm.gqa_attn_splitk
2025-05-07T20:30:58.3140147Z 
2025-05-07T20:30:58.3823991Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk
2025-05-07T20:31:02.3342479Z fbgemm.rope_qkv_decoding
2025-05-07T20:31:02.3342696Z 
2025-05-07T20:31:02.3996843Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding
2025-05-07T20:31:02.3997687Z [INSTALL] FBGEMM-GPU installation through wheel completed ...
2025-05-07T20:31:02.4033134Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV
2025-05-07T20:31:02.4033603Z [36;1m. $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV[0m
2025-05-07T20:31:02.4046896Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:31:02.4047471Z env:
2025-05-07T20:31:02.4047704Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:31:02.4048019Z   BUILD_ENV: build_binary
2025-05-07T20:31:02.4048277Z   BUILD_TARGET: genai
2025-05-07T20:31:02.4048519Z   BUILD_VARIANT: cuda
2025-05-07T20:31:02.4048761Z   BUILD_CUDA_VERSION: 12.6.3
2025-05-07T20:31:02.4049037Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:31:02.4049350Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:31:02.4049691Z ##[endgroup]
2025-05-07T20:31:02.7463126Z ################################################################################
2025-05-07T20:31:02.7463676Z # Test All FBGEMM-GPU Modules
2025-05-07T20:31:02.7463946Z #
2025-05-07T20:31:02.7471492Z # [2025-05-07T20:31:02.746Z] + test_all_fbgemm_gpu_modules build_binary
2025-05-07T20:31:02.7471941Z ################################################################################
2025-05-07T20:31:02.7472171Z 
2025-05-07T20:31:10.8133719Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda)
2025-05-07T20:31:10.8134737Z [TEST] Will be running tests specific to this target and variant ...
2025-05-07T20:31:10.8135285Z [TEST] Determined the test directories:
2025-05-07T20:31:10.8135614Z fbgemm_gpu/experimental/gen_ai/test
2025-05-07T20:31:10.8135924Z fbgemm_gpu/experimental/example/test
2025-05-07T20:31:10.8136236Z fbgemm_gpu/experimental/gemm/test
2025-05-07T20:31:10.8136431Z 
2025-05-07T20:31:10.8144147Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ...
2025-05-07T20:31:10.8151377Z [TEST] Set environment variables for CUDA testing ...
2025-05-07T20:31:10.8152020Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES
2025-05-07T20:31:10.8152446Z 
2025-05-07T20:31:11.2464508Z 
2025-05-07T20:31:11.2465000Z [TEST] Installing PyTest ...
2025-05-07T20:31:11.2488273Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest
2025-05-07T20:31:12.3506392Z Channels:
2025-05-07T20:31:12.3506774Z  - conda-forge
2025-05-07T20:31:12.3507087Z Platform: linux-64
2025-05-07T20:31:15.7437143Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:31:16.9076739Z Solving environment: \ | / done
2025-05-07T20:31:17.1356710Z 
2025-05-07T20:31:17.1356992Z ## Package Plan ##
2025-05-07T20:31:17.1357186Z 
2025-05-07T20:31:17.1357402Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:31:17.1357724Z 
2025-05-07T20:31:17.1357829Z   added / updated specs:
2025-05-07T20:31:17.1358102Z     - expecttest
2025-05-07T20:31:17.1358332Z     - pytest
2025-05-07T20:31:17.1358474Z 
2025-05-07T20:31:17.1358479Z 
2025-05-07T20:31:17.1358609Z The following packages will be downloaded:
2025-05-07T20:31:17.1358854Z 
2025-05-07T20:31:17.1358986Z     package                    |            build
2025-05-07T20:31:17.1359336Z     ---------------------------|-----------------
2025-05-07T20:31:17.1359735Z     colorama-0.4.6             |     pyhd8ed1ab_1          26 KB  conda-forge
2025-05-07T20:31:17.1360239Z     exceptiongroup-1.2.2       |     pyhd8ed1ab_1          20 KB  conda-forge
2025-05-07T20:31:17.1360729Z     expecttest-0.3.0           |     pyhd8ed1ab_0          14 KB  conda-forge
2025-05-07T20:31:17.1361185Z     iniconfig-2.0.0            |     pyhd8ed1ab_1          11 KB  conda-forge
2025-05-07T20:31:17.1361644Z     packaging-25.0             |     pyh29332c3_1          61 KB  conda-forge
2025-05-07T20:31:17.1362093Z     pluggy-1.5.0               |     pyhd8ed1ab_1          23 KB  conda-forge
2025-05-07T20:31:17.1363939Z     pytest-8.3.5               |     pyhd8ed1ab_0         254 KB  conda-forge
2025-05-07T20:31:17.1364658Z     tomli-2.2.1                |     pyhd8ed1ab_1          19 KB  conda-forge
2025-05-07T20:31:17.1365075Z     ------------------------------------------------------------
2025-05-07T20:31:17.1365445Z                                            Total:         428 KB
2025-05-07T20:31:17.1365667Z 
2025-05-07T20:31:17.1365803Z The following NEW packages will be INSTALLED:
2025-05-07T20:31:17.1366194Z 
2025-05-07T20:31:17.1366404Z   colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 
2025-05-07T20:31:17.1366947Z   exceptiongroup     conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 
2025-05-07T20:31:17.1367497Z   expecttest         conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 
2025-05-07T20:31:17.1367996Z   iniconfig          conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 
2025-05-07T20:31:17.1368540Z   packaging          conda-forge/noarch::packaging-25.0-pyh29332c3_1 
2025-05-07T20:31:17.1369020Z   pluggy             conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 
2025-05-07T20:31:17.1369486Z   pytest             conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 
2025-05-07T20:31:17.1369930Z   tomli              conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 
2025-05-07T20:31:17.1370207Z 
2025-05-07T20:31:17.1370211Z 
2025-05-07T20:31:17.1370215Z 
2025-05-07T20:31:17.1370369Z Downloading and Extracting Packages: ...working...
2025-05-07T20:31:17.1370780Z pytest-8.3.5         | 254 KB    |            |   0% 
2025-05-07T20:31:17.1371018Z 
2025-05-07T20:31:17.1371409Z packaging-25.0       | 61 KB     |            |   0% [A
2025-05-07T20:31:17.1371658Z 
2025-05-07T20:31:17.1371661Z 
2025-05-07T20:31:17.1397132Z colorama-0.4.6       | 26 KB     |            |   0% [A[A
2025-05-07T20:31:17.1397399Z 
2025-05-07T20:31:17.1397403Z 
2025-05-07T20:31:17.1400033Z 
2025-05-07T20:31:17.1410555Z pluggy-1.5.0         | 23 KB     |            |   0% [A[A[A
2025-05-07T20:31:17.1410827Z 
2025-05-07T20:31:17.1410831Z 
2025-05-07T20:31:17.1410834Z 
2025-05-07T20:31:17.1424766Z 
2025-05-07T20:31:17.1442709Z exceptiongroup-1.2.2 | 20 KB     |            |   0% [A[A[A[A
2025-05-07T20:31:17.1443024Z 
2025-05-07T20:31:17.1444705Z 
2025-05-07T20:31:17.1444711Z 
2025-05-07T20:31:17.1444715Z 
2025-05-07T20:31:17.1444738Z 
2025-05-07T20:31:17.1445371Z tomli-2.2.1          | 19 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:31:17.1445758Z 
2025-05-07T20:31:17.1445764Z 
2025-05-07T20:31:17.1445770Z 
2025-05-07T20:31:17.1445775Z 
2025-05-07T20:31:17.1445780Z 
2025-05-07T20:31:17.1445800Z 
2025-05-07T20:31:17.1454461Z expecttest-0.3.0     | 14 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:31:17.1454902Z 
2025-05-07T20:31:17.1454907Z 
2025-05-07T20:31:17.1454910Z 
2025-05-07T20:31:17.1454923Z 
2025-05-07T20:31:17.1454927Z 
2025-05-07T20:31:17.1454930Z 
2025-05-07T20:31:17.1454934Z 
2025-05-07T20:31:17.2546442Z iniconfig-2.0.0      | 11 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:31:17.2546767Z 
2025-05-07T20:31:17.2546780Z 
2025-05-07T20:31:17.2546858Z 
2025-05-07T20:31:17.2879963Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:31:17.2880725Z 
2025-05-07T20:31:17.2880737Z 
2025-05-07T20:31:17.2880747Z 
2025-05-07T20:31:17.2880757Z 
2025-05-07T20:31:17.2934458Z exceptiongroup-1.2.2 | 20 KB     | #######9   |  80% [A[A[A[A
2025-05-07T20:31:17.2935004Z 
2025-05-07T20:31:17.2935019Z 
2025-05-07T20:31:17.3030576Z colorama-0.4.6       | 26 KB     | ######     |  61% [A[A
2025-05-07T20:31:17.3030953Z 
2025-05-07T20:31:17.3030959Z 
2025-05-07T20:31:17.3037839Z 
2025-05-07T20:31:17.3076978Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:31:17.3077358Z 
2025-05-07T20:31:17.3077365Z 
2025-05-07T20:31:17.3077370Z 
2025-05-07T20:31:17.3087112Z 
2025-05-07T20:31:17.3095001Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:31:17.3095420Z 
2025-05-07T20:31:17.3098188Z 
2025-05-07T20:31:17.3415208Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:31:17.3415908Z 
2025-05-07T20:31:17.3415935Z 
2025-05-07T20:31:17.3415945Z 
2025-05-07T20:31:17.3415955Z 
2025-05-07T20:31:17.3416424Z 
2025-05-07T20:31:17.3430944Z tomli-2.2.1          | 19 KB     | ########5  |  85% [A[A[A[A[A
2025-05-07T20:31:17.3431312Z 
2025-05-07T20:31:17.3431328Z 
2025-05-07T20:31:17.3431333Z 
2025-05-07T20:31:17.3431338Z 
2025-05-07T20:31:17.3434506Z 
2025-05-07T20:31:17.3470627Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:31:17.3471000Z 
2025-05-07T20:31:17.3471005Z 
2025-05-07T20:31:17.3471010Z 
2025-05-07T20:31:17.3472183Z 
2025-05-07T20:31:17.3478683Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:31:17.3479103Z 
2025-05-07T20:31:17.3479108Z 
2025-05-07T20:31:17.3479114Z 
2025-05-07T20:31:17.3479119Z 
2025-05-07T20:31:17.3479124Z 
2025-05-07T20:31:17.3479129Z 
2025-05-07T20:31:17.3479133Z 
2025-05-07T20:31:17.3491400Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:31:17.3491801Z 
2025-05-07T20:31:17.3491807Z 
2025-05-07T20:31:17.3493515Z 
2025-05-07T20:31:17.3498554Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:31:17.3498921Z 
2025-05-07T20:31:17.3498926Z 
2025-05-07T20:31:17.3498931Z 
2025-05-07T20:31:17.3498936Z 
2025-05-07T20:31:17.3498941Z 
2025-05-07T20:31:17.3498946Z 
2025-05-07T20:31:17.3502481Z 
2025-05-07T20:31:17.3540536Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:31:17.3540936Z 
2025-05-07T20:31:17.3540941Z 
2025-05-07T20:31:17.3540946Z 
2025-05-07T20:31:17.3540951Z 
2025-05-07T20:31:17.3540956Z 
2025-05-07T20:31:17.3540961Z 
2025-05-07T20:31:17.3549265Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:31:17.3549672Z 
2025-05-07T20:31:17.3549677Z 
2025-05-07T20:31:17.3549682Z 
2025-05-07T20:31:17.3549687Z 
2025-05-07T20:31:17.3549692Z 
2025-05-07T20:31:17.3549754Z 
2025-05-07T20:31:17.3555832Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:31:17.3556128Z 
2025-05-07T20:31:17.3556328Z 
2025-05-07T20:31:17.3749012Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:31:17.3806571Z pytest-8.3.5         | 254 KB    | 6          |   6% 
2025-05-07T20:31:17.3806859Z 
2025-05-07T20:31:17.3807043Z 
2025-05-07T20:31:17.3807048Z 
2025-05-07T20:31:17.3807062Z 
2025-05-07T20:31:17.3807067Z 
2025-05-07T20:31:17.3807636Z 
2025-05-07T20:31:17.3864948Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:31:17.3907280Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:31:17.3907559Z 
2025-05-07T20:31:17.3915559Z packaging-25.0       | 61 KB     | ##6        |  26% [A
2025-05-07T20:31:17.3915823Z 
2025-05-07T20:31:17.3915827Z 
2025-05-07T20:31:17.3915830Z 
2025-05-07T20:31:17.3915834Z 
2025-05-07T20:31:17.3915838Z 
2025-05-07T20:31:17.3921470Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:31:17.3921735Z 
2025-05-07T20:31:17.3921739Z 
2025-05-07T20:31:17.3921743Z 
2025-05-07T20:31:17.3921747Z 
2025-05-07T20:31:17.3921750Z 
2025-05-07T20:31:17.3921763Z 
2025-05-07T20:31:17.3922850Z 
2025-05-07T20:31:17.3940819Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:31:17.3941435Z 
2025-05-07T20:31:17.4098764Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:31:17.4099164Z 
2025-05-07T20:31:17.4229755Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:31:17.4236740Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:31:17.4237687Z                                                      
2025-05-07T20:31:17.4238258Z 
2025-05-07T20:31:17.4238538Z                                                      [A
2025-05-07T20:31:17.4238779Z 
2025-05-07T20:31:17.4238784Z 
2025-05-07T20:31:17.4238965Z                                                      [A[A
2025-05-07T20:31:17.4239174Z 
2025-05-07T20:31:17.4239178Z 
2025-05-07T20:31:17.4239182Z 
2025-05-07T20:31:17.4239353Z                                                      [A[A[A
2025-05-07T20:31:17.4239854Z 
2025-05-07T20:31:17.4239860Z 
2025-05-07T20:31:17.4239864Z 
2025-05-07T20:31:17.4239867Z 
2025-05-07T20:31:17.4240057Z                                                      [A[A[A[A
2025-05-07T20:31:17.4240277Z 
2025-05-07T20:31:17.4240385Z 
2025-05-07T20:31:17.4240389Z 
2025-05-07T20:31:17.4240392Z 
2025-05-07T20:31:17.4240396Z 
2025-05-07T20:31:17.4240585Z                                                      [A[A[A[A[A
2025-05-07T20:31:17.4240813Z 
2025-05-07T20:31:17.4240817Z 
2025-05-07T20:31:17.4240820Z 
2025-05-07T20:31:17.4240824Z 
2025-05-07T20:31:17.4240827Z 
2025-05-07T20:31:17.4240831Z 
2025-05-07T20:31:17.4241007Z                                                      [A[A[A[A[A[A
2025-05-07T20:31:17.4241236Z 
2025-05-07T20:31:17.4241240Z 
2025-05-07T20:31:17.4241243Z 
2025-05-07T20:31:17.4241247Z 
2025-05-07T20:31:17.4241250Z 
2025-05-07T20:31:17.4241254Z 
2025-05-07T20:31:17.4241257Z 
2025-05-07T20:31:17.4241452Z                                                      [A[A[A[A[A[A[A done
2025-05-07T20:31:17.5245571Z Preparing transaction: \ done
2025-05-07T20:31:17.6248247Z Verifying transaction: / done
2025-05-07T20:31:19.5277382Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:31:19.6701329Z [TEST] Checking imports ...
2025-05-07T20:31:23.6781046Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:31:23.6794026Z [TEST] Setting feature flags ...
2025-05-07T20:31:23.6794661Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1
2025-05-07T20:31:23.6795160Z 
2025-05-07T20:31:24.1078988Z 
2025-05-07T20:31:24.1079994Z [TEST] PyTest args:  -v -rsx -s -W ignore::pytest.PytestCollectionWarning
2025-05-07T20:31:24.1080821Z ################################################################################
2025-05-07T20:31:24.1081268Z # Run FBGEMM-GPU Tests: 
2025-05-07T20:31:24.1081601Z #
2025-05-07T20:31:24.1100861Z # [2025-05-07T20:31:24.109Z] + __run_fbgemm_gpu_tests_in_directory build_binary
2025-05-07T20:31:24.1101464Z ################################################################################
2025-05-07T20:31:24.1101763Z 
2025-05-07T20:31:24.1108754Z [TEST] Enumerating ALL test files ...
2025-05-07T20:31:24.1138792Z ./attention/gqa_test.py
2025-05-07T20:31:24.1139175Z ./coalesce/coalesce_test.py
2025-05-07T20:31:24.1139557Z ./comm/multi_gpu_car_test.py
2025-05-07T20:31:24.1139933Z ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:24.1140304Z ./kv_cache/kv_cache_test.py
2025-05-07T20:31:24.1140566Z ./moe/activation_test.py
2025-05-07T20:31:24.1140813Z ./moe/gather_scatter_test.py
2025-05-07T20:31:24.1141072Z ./moe/layers_test.py
2025-05-07T20:31:24.1141308Z ./moe/shuffling_test.py
2025-05-07T20:31:24.1141551Z ./quantize/quantize_test.py
2025-05-07T20:31:24.1141728Z 
2025-05-07T20:31:24.1141846Z [TEST] Enumerating IGNORED test files ...
2025-05-07T20:31:24.1142073Z 
2025-05-07T20:31:24.1159608Z ################################################################################
2025-05-07T20:31:24.1174811Z # [2025-05-07T20:31:24.117Z] Run Python Test Suite:
2025-05-07T20:31:24.1175272Z #   ./attention/gqa_test.py
2025-05-07T20:31:24.1175659Z ################################################################################
2025-05-07T20:31:24.1198919Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py
2025-05-07T20:31:24.1199700Z 
2025-05-07T20:31:26.6611728Z ============================= test session starts ==============================
2025-05-07T20:31:26.6612604Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:26.6613211Z cachedir: .pytest_cache
2025-05-07T20:31:26.6613808Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:26.6614899Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:26.6615335Z plugins: hypothesis-6.131.14
2025-05-07T20:31:28.3345804Z collecting ... collected 2 items
2025-05-07T20:31:28.3346149Z 
2025-05-07T20:32:05.7728120Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa(
2025-05-07T20:32:05.7730038Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7730440Z     int4_kv=False,
2025-05-07T20:32:05.7730705Z     num_groups=1,
2025-05-07T20:32:05.7730955Z     B=1,
2025-05-07T20:32:05.7731181Z     MAX_T=4,
2025-05-07T20:32:05.7731441Z     N_H_L=1,
2025-05-07T20:32:05.7731693Z )
2025-05-07T20:32:05.7731930Z Trying example: test_gqa(
2025-05-07T20:32:05.7732292Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7732671Z     int4_kv=True,
2025-05-07T20:32:05.7732925Z     num_groups=1,
2025-05-07T20:32:05.7733176Z     B=1,
2025-05-07T20:32:05.7733394Z     MAX_T=4,
2025-05-07T20:32:05.7733626Z     N_H_L=1,
2025-05-07T20:32:05.7733854Z )
2025-05-07T20:32:05.7734095Z Trying example: test_gqa(
2025-05-07T20:32:05.7734456Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7735003Z     int4_kv=True,
2025-05-07T20:32:05.7735250Z     num_groups=4,
2025-05-07T20:32:05.7735503Z     B=23,
2025-05-07T20:32:05.7735743Z     MAX_T=33,
2025-05-07T20:32:05.7736010Z     N_H_L=68,
2025-05-07T20:32:05.7736245Z )
2025-05-07T20:32:05.7736484Z Trying example: test_gqa(
2025-05-07T20:32:05.7736831Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7737214Z     int4_kv=True,
2025-05-07T20:32:05.7737478Z     num_groups=4,
2025-05-07T20:32:05.7737731Z     B=77,
2025-05-07T20:32:05.7737952Z     MAX_T=4,
2025-05-07T20:32:05.7738189Z     N_H_L=1,
2025-05-07T20:32:05.7738421Z )
2025-05-07T20:32:05.7738650Z Trying example: test_gqa(
2025-05-07T20:32:05.7739010Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7739393Z     int4_kv=True,
2025-05-07T20:32:05.7739646Z     num_groups=4,
2025-05-07T20:32:05.7739893Z     B=77,
2025-05-07T20:32:05.7740123Z     MAX_T=52,
2025-05-07T20:32:05.7740353Z     N_H_L=67,
2025-05-07T20:32:05.7740586Z )
2025-05-07T20:32:05.7740821Z Trying example: test_gqa(
2025-05-07T20:32:05.7741171Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7741611Z     int4_kv=False,
2025-05-07T20:32:05.7741874Z     num_groups=4,
2025-05-07T20:32:05.7742116Z     B=57,
2025-05-07T20:32:05.7742343Z     MAX_T=45,
2025-05-07T20:32:05.7742585Z     N_H_L=120,
2025-05-07T20:32:05.7742813Z )
2025-05-07T20:32:05.7743047Z Trying example: test_gqa(
2025-05-07T20:32:05.7743396Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7743766Z     int4_kv=True,
2025-05-07T20:32:05.7744017Z     num_groups=4,
2025-05-07T20:32:05.7744263Z     B=52,
2025-05-07T20:32:05.7744480Z     MAX_T=42,
2025-05-07T20:32:05.7744717Z     N_H_L=53,
2025-05-07T20:32:05.7745171Z )
2025-05-07T20:32:05.7745398Z Trying example: test_gqa(
2025-05-07T20:32:05.7745753Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7746144Z     int4_kv=True,
2025-05-07T20:32:05.7746397Z     num_groups=1,
2025-05-07T20:32:05.7746636Z     B=77,
2025-05-07T20:32:05.7746862Z     MAX_T=95,
2025-05-07T20:32:05.7747096Z     N_H_L=53,
2025-05-07T20:32:05.7747321Z )
2025-05-07T20:32:05.7747563Z Trying example: test_gqa(
2025-05-07T20:32:05.7747916Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7748287Z     int4_kv=True,
2025-05-07T20:32:05.7748540Z     num_groups=4,
2025-05-07T20:32:05.7748794Z     B=113,
2025-05-07T20:32:05.7749013Z     MAX_T=48,
2025-05-07T20:32:05.7749246Z     N_H_L=96,
2025-05-07T20:32:05.7749477Z )
2025-05-07T20:32:05.7749705Z Trying example: test_gqa(
2025-05-07T20:32:05.7750055Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7750434Z     int4_kv=False,
2025-05-07T20:32:05.7750683Z     num_groups=1,
2025-05-07T20:32:05.7750929Z     B=51,
2025-05-07T20:32:05.7751154Z     MAX_T=61,
2025-05-07T20:32:05.7751384Z     N_H_L=69,
2025-05-07T20:32:05.7751847Z )
2025-05-07T20:32:05.7752090Z Trying example: test_gqa(
2025-05-07T20:32:05.7752438Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7752825Z     int4_kv=False,
2025-05-07T20:32:05.7753080Z     num_groups=4,
2025-05-07T20:32:05.7753408Z     B=17,
2025-05-07T20:32:05.7753639Z     MAX_T=113,
2025-05-07T20:32:05.7753878Z     N_H_L=65,
2025-05-07T20:32:05.7754102Z )
2025-05-07T20:32:05.7754335Z Trying example: test_gqa(
2025-05-07T20:32:05.7754685Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7755070Z     int4_kv=False,
2025-05-07T20:32:05.7755318Z     num_groups=4,
2025-05-07T20:32:05.7755563Z     B=17,
2025-05-07T20:32:05.7755794Z     MAX_T=65,
2025-05-07T20:32:05.7756031Z     N_H_L=65,
2025-05-07T20:32:05.7756266Z )
2025-05-07T20:32:05.7756501Z Trying example: test_gqa(
2025-05-07T20:32:05.7756856Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7757243Z     int4_kv=False,
2025-05-07T20:32:05.7757508Z     num_groups=4,
2025-05-07T20:32:05.7757758Z     B=65,
2025-05-07T20:32:05.7757988Z     MAX_T=65,
2025-05-07T20:32:05.7758227Z     N_H_L=65,
2025-05-07T20:32:05.7758453Z )
2025-05-07T20:32:05.7758687Z Trying example: test_gqa(
2025-05-07T20:32:05.7759042Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7759431Z     int4_kv=False,
2025-05-07T20:32:05.7759688Z     num_groups=1,
2025-05-07T20:32:05.7759936Z     B=6,
2025-05-07T20:32:05.7760161Z     MAX_T=108,
2025-05-07T20:32:05.7760409Z     N_H_L=14,
2025-05-07T20:32:05.7760649Z )
2025-05-07T20:32:05.7760875Z Trying example: test_gqa(
2025-05-07T20:32:05.7761229Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7761665Z     int4_kv=False,
2025-05-07T20:32:05.7761916Z     num_groups=1,
2025-05-07T20:32:05.7762168Z     B=6,
2025-05-07T20:32:05.7762397Z     MAX_T=14,
2025-05-07T20:32:05.7762631Z     N_H_L=14,
2025-05-07T20:32:05.7762866Z )
2025-05-07T20:32:05.7763104Z Trying example: test_gqa(
2025-05-07T20:32:05.7763472Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7763854Z     int4_kv=False,
2025-05-07T20:32:05.7764109Z     num_groups=1,
2025-05-07T20:32:05.7764356Z     B=6,
2025-05-07T20:32:05.7764572Z     MAX_T=6,
2025-05-07T20:32:05.7764805Z     N_H_L=14,
2025-05-07T20:32:05.7765045Z )
2025-05-07T20:32:05.7765270Z Trying example: test_gqa(
2025-05-07T20:32:05.7765622Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7766006Z     int4_kv=False,
2025-05-07T20:32:05.7766251Z     num_groups=1,
2025-05-07T20:32:05.7766500Z     B=6,
2025-05-07T20:32:05.7766724Z     MAX_T=6,
2025-05-07T20:32:05.7766947Z     N_H_L=6,
2025-05-07T20:32:05.7767175Z )
2025-05-07T20:32:05.7767409Z Trying example: test_gqa(
2025-05-07T20:32:05.7767754Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7768142Z     int4_kv=False,
2025-05-07T20:32:05.7768398Z     num_groups=1,
2025-05-07T20:32:05.7768638Z     B=70,
2025-05-07T20:32:05.7768868Z     MAX_T=94,
2025-05-07T20:32:05.7769107Z     N_H_L=78,
2025-05-07T20:32:05.7769340Z )
2025-05-07T20:32:05.7769577Z Trying example: test_gqa(
2025-05-07T20:32:05.7769930Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7770309Z     int4_kv=False,
2025-05-07T20:32:05.7770565Z     num_groups=1,
2025-05-07T20:32:05.7770818Z     B=78,
2025-05-07T20:32:05.7771037Z     MAX_T=94,
2025-05-07T20:32:05.7771273Z     N_H_L=78,
2025-05-07T20:32:05.7771508Z )
2025-05-07T20:32:05.7771729Z Trying example: test_gqa(
2025-05-07T20:32:05.7772080Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7772461Z     int4_kv=False,
2025-05-07T20:32:05.7772716Z     num_groups=1,
2025-05-07T20:32:05.7772954Z     B=94,
2025-05-07T20:32:05.7773179Z     MAX_T=94,
2025-05-07T20:32:05.7773419Z     N_H_L=78,
2025-05-07T20:32:05.7773644Z )
2025-05-07T20:32:05.7773878Z Trying example: test_gqa(
2025-05-07T20:32:05.7774230Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7774733Z     int4_kv=False,
2025-05-07T20:32:05.7775148Z     num_groups=1,
2025-05-07T20:32:05.7775409Z     B=94,
2025-05-07T20:32:05.7775643Z     MAX_T=94,
2025-05-07T20:32:05.7775891Z     N_H_L=94,
2025-05-07T20:32:05.7776136Z )
2025-05-07T20:32:05.7776365Z Trying example: test_gqa(
2025-05-07T20:32:05.7776737Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7777184Z     int4_kv=False,
2025-05-07T20:32:05.7777394Z     num_groups=4,
2025-05-07T20:32:05.7777608Z     B=41,
2025-05-07T20:32:05.7777806Z     MAX_T=105,
2025-05-07T20:32:05.7777998Z     N_H_L=126,
2025-05-07T20:32:05.7778196Z )
2025-05-07T20:32:05.7778392Z Trying example: test_gqa(
2025-05-07T20:32:05.7778677Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7778996Z     int4_kv=False,
2025-05-07T20:32:05.7779204Z     num_groups=4,
2025-05-07T20:32:05.7779405Z     B=105,
2025-05-07T20:32:05.7779600Z     MAX_T=105,
2025-05-07T20:32:05.7779809Z     N_H_L=126,
2025-05-07T20:32:05.7780004Z )
2025-05-07T20:32:05.7780209Z Trying example: test_gqa(
2025-05-07T20:32:05.7780499Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7780870Z     int4_kv=False,
2025-05-07T20:32:05.7781140Z     num_groups=4,
2025-05-07T20:32:05.7781679Z     B=105,
2025-05-07T20:32:05.7781958Z     MAX_T=105,
2025-05-07T20:32:05.7782201Z     N_H_L=105,
2025-05-07T20:32:05.7791707Z )
2025-05-07T20:32:05.7791935Z Trying example: test_gqa(
2025-05-07T20:32:05.7792252Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7792579Z     int4_kv=True,
2025-05-07T20:32:05.7792790Z     num_groups=1,
2025-05-07T20:32:05.7792998Z     B=95,
2025-05-07T20:32:05.7793198Z     MAX_T=114,
2025-05-07T20:32:05.7793398Z     N_H_L=43,
2025-05-07T20:32:05.7793604Z )
2025-05-07T20:32:05.7793808Z Trying example: test_gqa(
2025-05-07T20:32:05.7794107Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7794426Z     int4_kv=True,
2025-05-07T20:32:05.7794645Z     num_groups=1,
2025-05-07T20:32:05.7794871Z     B=43,
2025-05-07T20:32:05.7795059Z     MAX_T=114,
2025-05-07T20:32:05.7795278Z     N_H_L=43,
2025-05-07T20:32:05.7795462Z )
2025-05-07T20:32:05.7795648Z Trying example: test_gqa(
2025-05-07T20:32:05.7795954Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7796283Z     int4_kv=True,
2025-05-07T20:32:05.7796498Z     num_groups=1,
2025-05-07T20:32:05.7796715Z     B=43,
2025-05-07T20:32:05.7796915Z     MAX_T=43,
2025-05-07T20:32:05.7797112Z     N_H_L=43,
2025-05-07T20:32:05.7797310Z )
2025-05-07T20:32:05.7797509Z Trying example: test_gqa(
2025-05-07T20:32:05.7797797Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7798123Z     int4_kv=False,
2025-05-07T20:32:05.7798345Z     num_groups=1,
2025-05-07T20:32:05.7798550Z     B=21,
2025-05-07T20:32:05.7798747Z     MAX_T=38,
2025-05-07T20:32:05.7798951Z     N_H_L=42,
2025-05-07T20:32:05.7799144Z )
2025-05-07T20:32:05.7799344Z Trying example: test_gqa(
2025-05-07T20:32:05.7799643Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7799964Z     int4_kv=False,
2025-05-07T20:32:05.7800181Z     num_groups=1,
2025-05-07T20:32:05.7800389Z     B=38,
2025-05-07T20:32:05.7800573Z     MAX_T=38,
2025-05-07T20:32:05.7800773Z     N_H_L=42,
2025-05-07T20:32:05.7800966Z )
2025-05-07T20:32:05.7801162Z Trying example: test_gqa(
2025-05-07T20:32:05.7801455Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7801794Z     int4_kv=False,
2025-05-07T20:32:05.7802050Z     num_groups=1,
2025-05-07T20:32:05.7802252Z     B=38,
2025-05-07T20:32:05.7802445Z     MAX_T=42,
2025-05-07T20:32:05.7802645Z     N_H_L=42,
2025-05-07T20:32:05.7802837Z )
2025-05-07T20:32:05.7803037Z Trying example: test_gqa(
2025-05-07T20:32:05.7803336Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7803655Z     int4_kv=False,
2025-05-07T20:32:05.7803874Z     num_groups=1,
2025-05-07T20:32:05.7804087Z     B=42,
2025-05-07T20:32:05.7804275Z     MAX_T=42,
2025-05-07T20:32:05.7804481Z     N_H_L=42,
2025-05-07T20:32:05.7804683Z )
2025-05-07T20:32:05.7805006Z Trying example: test_gqa(
2025-05-07T20:32:05.7805311Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7805636Z     int4_kv=True,
2025-05-07T20:32:05.7805843Z     num_groups=1,
2025-05-07T20:32:05.7806054Z     B=74,
2025-05-07T20:32:05.7806340Z     MAX_T=20,
2025-05-07T20:32:05.7806535Z     N_H_L=15,
2025-05-07T20:32:05.7806733Z )
2025-05-07T20:32:05.7806939Z Trying example: test_gqa(
2025-05-07T20:32:05.7807235Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7807567Z     int4_kv=True,
2025-05-07T20:32:05.7807784Z     num_groups=1,
2025-05-07T20:32:05.7807990Z     B=20,
2025-05-07T20:32:05.7808191Z     MAX_T=20,
2025-05-07T20:32:05.7808396Z     N_H_L=15,
2025-05-07T20:32:05.7808592Z )
2025-05-07T20:32:05.7808793Z Trying example: test_gqa(
2025-05-07T20:32:05.7809094Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7809419Z     int4_kv=True,
2025-05-07T20:32:05.7809628Z     num_groups=1,
2025-05-07T20:32:05.7809836Z     B=20,
2025-05-07T20:32:05.7810037Z     MAX_T=15,
2025-05-07T20:32:05.7810231Z     N_H_L=15,
2025-05-07T20:32:05.7810429Z )
2025-05-07T20:32:05.7810636Z Trying example: test_gqa(
2025-05-07T20:32:05.7810932Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7811263Z     int4_kv=True,
2025-05-07T20:32:05.7811497Z     num_groups=1,
2025-05-07T20:32:05.7811723Z     B=15,
2025-05-07T20:32:05.7811912Z     MAX_T=20,
2025-05-07T20:32:05.7812117Z     N_H_L=15,
2025-05-07T20:32:05.7812307Z )
2025-05-07T20:32:05.7812498Z Trying example: test_gqa(
2025-05-07T20:32:05.7812788Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7813099Z     int4_kv=True,
2025-05-07T20:32:05.7813310Z     num_groups=1,
2025-05-07T20:32:05.7813511Z     B=15,
2025-05-07T20:32:05.7813692Z     MAX_T=15,
2025-05-07T20:32:05.7813888Z     N_H_L=15,
2025-05-07T20:32:05.7814077Z )
2025-05-07T20:32:05.7814263Z Trying example: test_gqa(
2025-05-07T20:32:05.7814669Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7815015Z     int4_kv=False,
2025-05-07T20:32:05.7815220Z     num_groups=4,
2025-05-07T20:32:05.7815437Z     B=117,
2025-05-07T20:32:05.7815630Z     MAX_T=104,
2025-05-07T20:32:05.7815819Z     N_H_L=69,
2025-05-07T20:32:05.7816015Z )
2025-05-07T20:32:05.7816211Z Trying example: test_gqa(
2025-05-07T20:32:05.7816506Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7816817Z     int4_kv=False,
2025-05-07T20:32:05.7817032Z     num_groups=4,
2025-05-07T20:32:05.7817239Z     B=117,
2025-05-07T20:32:05.7817421Z     MAX_T=117,
2025-05-07T20:32:05.7817618Z     N_H_L=69,
2025-05-07T20:32:05.7817809Z )
2025-05-07T20:32:05.7817995Z Trying example: test_gqa(
2025-05-07T20:32:05.7818292Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7818614Z     int4_kv=False,
2025-05-07T20:32:05.7818818Z     num_groups=4,
2025-05-07T20:32:05.7819026Z     B=69,
2025-05-07T20:32:05.7819217Z     MAX_T=117,
2025-05-07T20:32:05.7819410Z     N_H_L=69,
2025-05-07T20:32:05.7819605Z )
2025-05-07T20:32:05.7819802Z Trying example: test_gqa(
2025-05-07T20:32:05.7820086Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:32:05.7820408Z     int4_kv=False,
2025-05-07T20:32:05.7820619Z     num_groups=4,
2025-05-07T20:32:05.7820817Z     B=117,
2025-05-07T20:32:05.7821015Z     MAX_T=69,
2025-05-07T20:32:05.7821211Z     N_H_L=69,
2025-05-07T20:32:05.7821390Z )
2025-05-07T20:32:05.7821573Z PASSED
2025-05-07T20:32:05.7931567Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...)
2025-05-07T20:32:05.7931914Z 
2025-05-07T20:32:05.7932096Z =========================== short test summary info ============================
2025-05-07T20:32:05.7932853Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when CUDA is not available or xformers is not available
2025-05-07T20:32:05.7933598Z ======================== 1 passed, 1 skipped in 39.63s =========================
2025-05-07T20:32:06.4649061Z 
2025-05-07T20:32:06.4650172Z [TEST] Python test suite PASSED: ./attention/gqa_test.py
2025-05-07T20:32:06.4670134Z [TEST] Python test time for ./attention/gqa_test.py: 42 seconds
2025-05-07T20:32:06.4670428Z 
2025-05-07T20:32:06.4670433Z 
2025-05-07T20:32:06.4670436Z 
2025-05-07T20:32:06.4670651Z 
2025-05-07T20:32:06.4692908Z ################################################################################
2025-05-07T20:32:06.4709110Z # [2025-05-07T20:32:06.470Z] Run Python Test Suite:
2025-05-07T20:32:06.4709453Z #   ./coalesce/coalesce_test.py
2025-05-07T20:32:06.4709755Z ################################################################################
2025-05-07T20:32:06.4733994Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py
2025-05-07T20:32:06.4734714Z 
2025-05-07T20:32:08.6441358Z ============================= test session starts ==============================
2025-05-07T20:32:08.6442070Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:08.6442633Z cachedir: .pytest_cache
2025-05-07T20:32:08.6443314Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:08.6444193Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:08.6444671Z plugins: hypothesis-6.131.14
2025-05-07T20:32:10.3877384Z collecting ... collected 1 item
2025-05-07T20:32:10.3877623Z 
2025-05-07T20:32:11.1683453Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED
2025-05-07T20:32:11.1683815Z 
2025-05-07T20:32:11.1683974Z ============================== 1 passed in 2.65s ===============================
2025-05-07T20:32:11.8201791Z 
2025-05-07T20:32:11.8202103Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py
2025-05-07T20:32:11.8222678Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds
2025-05-07T20:32:11.8223004Z 
2025-05-07T20:32:11.8223009Z 
2025-05-07T20:32:11.8223013Z 
2025-05-07T20:32:11.8223017Z 
2025-05-07T20:32:11.8245730Z ################################################################################
2025-05-07T20:32:11.8261098Z # [2025-05-07T20:32:11.825Z] Run Python Test Suite:
2025-05-07T20:32:11.8261462Z #   ./comm/multi_gpu_car_test.py
2025-05-07T20:32:11.8261759Z ################################################################################
2025-05-07T20:32:11.8287505Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py
2025-05-07T20:32:11.8288152Z 
2025-05-07T20:32:14.0201812Z ============================= test session starts ==============================
2025-05-07T20:32:14.0202878Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:14.0203799Z cachedir: .pytest_cache
2025-05-07T20:32:14.0204782Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:14.0206061Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:14.0206802Z plugins: hypothesis-6.131.14
2025-05-07T20:32:15.7376329Z collecting ... collected 5 items
2025-05-07T20:32:15.7376567Z 
2025-05-07T20:32:15.7390566Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED
2025-05-07T20:32:15.7400248Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED
2025-05-07T20:32:15.7408955Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED
2025-05-07T20:32:15.7417716Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED
2025-05-07T20:32:15.7437919Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED
2025-05-07T20:32:15.7438274Z 
2025-05-07T20:32:15.7438776Z =========================== short test summary info ============================
2025-05-07T20:32:15.7439506Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:32:15.7440499Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:32:15.7441671Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:32:15.7442659Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:32:15.7443644Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:32:15.7444338Z ============================== 5 skipped in 1.85s ==============================
2025-05-07T20:32:16.3228584Z 
2025-05-07T20:32:16.3229082Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py
2025-05-07T20:32:16.3248371Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 5 seconds
2025-05-07T20:32:16.3248677Z 
2025-05-07T20:32:16.3248777Z 
2025-05-07T20:32:16.3248787Z 
2025-05-07T20:32:16.3248822Z 
2025-05-07T20:32:16.3271132Z ################################################################################
2025-05-07T20:32:16.3286732Z # [2025-05-07T20:32:16.328Z] Run Python Test Suite:
2025-05-07T20:32:16.3287236Z #   ./gather_scatter/gather_scatter_test.py
2025-05-07T20:32:16.3287675Z ################################################################################
2025-05-07T20:32:16.3311419Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py
2025-05-07T20:32:16.3312219Z 
2025-05-07T20:32:18.5011851Z ============================= test session starts ==============================
2025-05-07T20:32:18.5012589Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:18.5013139Z cachedir: .pytest_cache
2025-05-07T20:32:18.5013754Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:18.5014598Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:18.5015018Z plugins: hypothesis-6.131.14
2025-05-07T20:32:20.3092863Z collecting ... collected 2 items
2025-05-07T20:32:20.3093182Z 
2025-05-07T20:32:20.3104464Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED
2025-05-07T20:32:20.3121497Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED
2025-05-07T20:32:20.3122103Z 
2025-05-07T20:32:20.3122330Z =========================== short test summary info ============================
2025-05-07T20:32:20.3122983Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:32:20.3123854Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:32:20.3124488Z ============================== 2 skipped in 1.93s ==============================
2025-05-07T20:32:20.9175525Z 
2025-05-07T20:32:20.9176254Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py
2025-05-07T20:32:20.9198292Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 4 seconds
2025-05-07T20:32:20.9198791Z 
2025-05-07T20:32:20.9198796Z 
2025-05-07T20:32:20.9198801Z 
2025-05-07T20:32:20.9198806Z 
2025-05-07T20:32:20.9221598Z ################################################################################
2025-05-07T20:32:20.9237989Z # [2025-05-07T20:32:20.923Z] Run Python Test Suite:
2025-05-07T20:32:20.9238914Z #   ./kv_cache/kv_cache_test.py
2025-05-07T20:32:20.9239291Z ################################################################################
2025-05-07T20:32:20.9263739Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py
2025-05-07T20:32:20.9264731Z 
2025-05-07T20:32:23.0935959Z ============================= test session starts ==============================
2025-05-07T20:32:23.0936712Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:23.0937244Z cachedir: .pytest_cache
2025-05-07T20:32:23.0937838Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:23.0938594Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:23.0939017Z plugins: hypothesis-6.131.14
2025-05-07T20:32:24.7942484Z collecting ... collected 4 items
2025-05-07T20:32:24.7942802Z 
2025-05-07T20:32:27.6271323Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...)
2025-05-07T20:32:27.6356825Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED
2025-05-07T20:32:27.6452311Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED
2025-05-07T20:32:27.6542058Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED
2025-05-07T20:32:27.6542434Z 
2025-05-07T20:32:27.6542610Z =========================== short test summary info ============================
2025-05-07T20:32:27.6543353Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when H100 is not available or MI300 is not available
2025-05-07T20:32:27.6544317Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when xformers is not available
2025-05-07T20:32:27.6544987Z ============================== 4 skipped in 4.68s ==============================
2025-05-07T20:32:29.6340009Z 
2025-05-07T20:32:29.6340606Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py
2025-05-07T20:32:29.6362224Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 9 seconds
2025-05-07T20:32:29.6362843Z 
2025-05-07T20:32:29.6362852Z 
2025-05-07T20:32:29.6362872Z 
2025-05-07T20:32:29.6362879Z 
2025-05-07T20:32:29.6386153Z ################################################################################
2025-05-07T20:32:29.6401891Z # [2025-05-07T20:32:29.639Z] Run Python Test Suite:
2025-05-07T20:32:29.6402231Z #   ./moe/activation_test.py
2025-05-07T20:32:29.6402710Z ################################################################################
2025-05-07T20:32:29.6427669Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py
2025-05-07T20:32:29.6428357Z 
2025-05-07T20:32:31.8245248Z ============================= test session starts ==============================
2025-05-07T20:32:31.8245893Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:31.8246432Z cachedir: .pytest_cache
2025-05-07T20:32:31.8247034Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:31.8247800Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:31.8248222Z plugins: hypothesis-6.131.14
2025-05-07T20:32:33.5044203Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:33.6129676Z collecting ... collected 2 items
2025-05-07T20:32:33.6129957Z 
2025-05-07T20:32:39.1199672Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul(
2025-05-07T20:32:39.1201046Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1201965Z     T=1,
2025-05-07T20:32:39.1202160Z     D=5120,
2025-05-07T20:32:39.1202364Z     contiguous=True,
2025-05-07T20:32:39.1202601Z     compiled=True,
2025-05-07T20:32:39.1202804Z )
2025-05-07T20:32:39.1203006Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1203594Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1203984Z     T=4096,
2025-05-07T20:32:39.1204185Z     D=5120,
2025-05-07T20:32:39.1204384Z     contiguous=True,
2025-05-07T20:32:39.1204607Z     compiled=True,
2025-05-07T20:32:39.1204817Z )
2025-05-07T20:32:39.1205017Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1205396Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1205799Z     T=4096,
2025-05-07T20:32:39.1205994Z     D=7168,
2025-05-07T20:32:39.1206187Z     contiguous=False,
2025-05-07T20:32:39.1206416Z     compiled=False,
2025-05-07T20:32:39.1206627Z )
2025-05-07T20:32:39.1206815Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1207206Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1207605Z     T=4096,
2025-05-07T20:32:39.1207796Z     D=5120,
2025-05-07T20:32:39.1207989Z     contiguous=False,
2025-05-07T20:32:39.1208221Z     compiled=True,
2025-05-07T20:32:39.1208430Z )
2025-05-07T20:32:39.1208624Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1209005Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1209395Z     T=1,
2025-05-07T20:32:39.1209575Z     D=7168,
2025-05-07T20:32:39.1209777Z     contiguous=True,
2025-05-07T20:32:39.1210009Z     compiled=True,
2025-05-07T20:32:39.1210207Z )
2025-05-07T20:32:39.1210408Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1210791Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1211181Z     T=1,
2025-05-07T20:32:39.1211374Z     D=7168,
2025-05-07T20:32:39.1211580Z     contiguous=False,
2025-05-07T20:32:39.1211803Z     compiled=True,
2025-05-07T20:32:39.1212024Z )
2025-05-07T20:32:39.1212234Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1212613Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1213011Z     T=4096,
2025-05-07T20:32:39.1213214Z     D=5120,
2025-05-07T20:32:39.1213416Z     contiguous=False,
2025-05-07T20:32:39.1213653Z     compiled=False,
2025-05-07T20:32:39.1213872Z )
2025-05-07T20:32:39.1214072Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1214597Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1215007Z     T=1,
2025-05-07T20:32:39.1215212Z     D=7168,
2025-05-07T20:32:39.1215414Z     contiguous=True,
2025-05-07T20:32:39.1215660Z     compiled=False,
2025-05-07T20:32:39.1215884Z )
2025-05-07T20:32:39.1216089Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1216483Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1216892Z     T=2048,
2025-05-07T20:32:39.1217091Z     D=5120,
2025-05-07T20:32:39.1217308Z     contiguous=True,
2025-05-07T20:32:39.1217547Z     compiled=True,
2025-05-07T20:32:39.1217750Z )
2025-05-07T20:32:39.1217961Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1218349Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1218748Z     T=2048,
2025-05-07T20:32:39.1218944Z     D=7168,
2025-05-07T20:32:39.1219143Z     contiguous=True,
2025-05-07T20:32:39.1219364Z     compiled=True,
2025-05-07T20:32:39.1219574Z )
2025-05-07T20:32:39.1219781Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1220167Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1220573Z     T=2048,
2025-05-07T20:32:39.1220774Z     D=7168,
2025-05-07T20:32:39.1220983Z     contiguous=True,
2025-05-07T20:32:39.1221216Z     compiled=False,
2025-05-07T20:32:39.1221435Z )
2025-05-07T20:32:39.1221649Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1222142Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1222646Z     T=128,
2025-05-07T20:32:39.1222847Z     D=5120,
2025-05-07T20:32:39.1223045Z     contiguous=False,
2025-05-07T20:32:39.1223282Z     compiled=True,
2025-05-07T20:32:39.1223503Z )
2025-05-07T20:32:39.1223701Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1224184Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1224596Z     T=128,
2025-05-07T20:32:39.1224786Z     D=5120,
2025-05-07T20:32:39.1225000Z     contiguous=True,
2025-05-07T20:32:39.1225240Z     compiled=True,
2025-05-07T20:32:39.1225639Z )
2025-05-07T20:32:39.1225860Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1226262Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1226659Z     T=16384,
2025-05-07T20:32:39.1226872Z     D=5120,
2025-05-07T20:32:39.1227087Z     contiguous=False,
2025-05-07T20:32:39.1227328Z     compiled=True,
2025-05-07T20:32:39.1227537Z )
2025-05-07T20:32:39.1227748Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1228164Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1228572Z     T=16384,
2025-05-07T20:32:39.1228784Z     D=5120,
2025-05-07T20:32:39.1228987Z     contiguous=False,
2025-05-07T20:32:39.1229228Z     compiled=False,
2025-05-07T20:32:39.1229515Z )
2025-05-07T20:32:39.1229781Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1230278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1230724Z     T=128,
2025-05-07T20:32:39.1231177Z     D=7168,
2025-05-07T20:32:39.1241419Z     contiguous=True,
2025-05-07T20:32:39.1241706Z     compiled=False,
2025-05-07T20:32:39.1241926Z )
2025-05-07T20:32:39.1242150Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1242557Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1242971Z     T=128,
2025-05-07T20:32:39.1243195Z     D=7168,
2025-05-07T20:32:39.1243417Z     contiguous=False,
2025-05-07T20:32:39.1243657Z     compiled=False,
2025-05-07T20:32:39.1243906Z )
2025-05-07T20:32:39.1244134Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1244524Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1244945Z     T=1,
2025-05-07T20:32:39.1245153Z     D=5120,
2025-05-07T20:32:39.1245367Z     contiguous=False,
2025-05-07T20:32:39.1245617Z     compiled=False,
2025-05-07T20:32:39.1245848Z )
2025-05-07T20:32:39.1246053Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1246455Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1246869Z     T=1,
2025-05-07T20:32:39.1247080Z     D=7168,
2025-05-07T20:32:39.1247288Z     contiguous=False,
2025-05-07T20:32:39.1247539Z     compiled=False,
2025-05-07T20:32:39.1247768Z )
2025-05-07T20:32:39.1247974Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1248386Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1248795Z     T=4096,
2025-05-07T20:32:39.1248992Z     D=5120,
2025-05-07T20:32:39.1249215Z     contiguous=True,
2025-05-07T20:32:39.1249456Z     compiled=False,
2025-05-07T20:32:39.1249665Z )
2025-05-07T20:32:39.1249874Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1250270Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1250672Z     T=128,
2025-05-07T20:32:39.1250872Z     D=7168,
2025-05-07T20:32:39.1251083Z     contiguous=True,
2025-05-07T20:32:39.1251307Z     compiled=True,
2025-05-07T20:32:39.1251524Z )
2025-05-07T20:32:39.1251733Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1252173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1252583Z     T=1,
2025-05-07T20:32:39.1252777Z     D=5120,
2025-05-07T20:32:39.1252977Z     contiguous=False,
2025-05-07T20:32:39.1253213Z     compiled=True,
2025-05-07T20:32:39.1253429Z )
2025-05-07T20:32:39.1253632Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1254020Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1254699Z     T=4096,
2025-05-07T20:32:39.1254905Z     D=7168,
2025-05-07T20:32:39.1255108Z     contiguous=True,
2025-05-07T20:32:39.1255343Z     compiled=False,
2025-05-07T20:32:39.1255564Z )
2025-05-07T20:32:39.1255768Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1256286Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1256686Z     T=4096,
2025-05-07T20:32:39.1256881Z     D=7168,
2025-05-07T20:32:39.1257088Z     contiguous=False,
2025-05-07T20:32:39.1257326Z     compiled=True,
2025-05-07T20:32:39.1257535Z )
2025-05-07T20:32:39.1257745Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1258137Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1258533Z     T=128,
2025-05-07T20:32:39.1258739Z     D=5120,
2025-05-07T20:32:39.1258951Z     contiguous=True,
2025-05-07T20:32:39.1259181Z     compiled=False,
2025-05-07T20:32:39.1259404Z )
2025-05-07T20:32:39.1259616Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1260019Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1260412Z     T=128,
2025-05-07T20:32:39.1260613Z     D=5120,
2025-05-07T20:32:39.1260852Z     contiguous=False,
2025-05-07T20:32:39.1261105Z     compiled=False,
2025-05-07T20:32:39.1261331Z )
2025-05-07T20:32:39.1261539Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1261959Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1262356Z     T=1,
2025-05-07T20:32:39.1262553Z     D=5120,
2025-05-07T20:32:39.1262767Z     contiguous=True,
2025-05-07T20:32:39.1262998Z     compiled=False,
2025-05-07T20:32:39.1263220Z )
2025-05-07T20:32:39.1263429Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1263816Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1264221Z     T=2048,
2025-05-07T20:32:39.1264419Z     D=7168,
2025-05-07T20:32:39.1264625Z     contiguous=False,
2025-05-07T20:32:39.1264860Z     compiled=True,
2025-05-07T20:32:39.1265081Z )
2025-05-07T20:32:39.1265279Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1265671Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1266070Z     T=2048,
2025-05-07T20:32:39.1266266Z     D=7168,
2025-05-07T20:32:39.1266482Z     contiguous=False,
2025-05-07T20:32:39.1266720Z     compiled=False,
2025-05-07T20:32:39.1266934Z )
2025-05-07T20:32:39.1267141Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1267534Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1267936Z     T=16384,
2025-05-07T20:32:39.1268147Z     D=7168,
2025-05-07T20:32:39.1268365Z     contiguous=False,
2025-05-07T20:32:39.1268613Z     compiled=True,
2025-05-07T20:32:39.1268828Z )
2025-05-07T20:32:39.1269052Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1269452Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1269857Z     T=16384,
2025-05-07T20:32:39.1270071Z     D=7168,
2025-05-07T20:32:39.1270293Z     contiguous=True,
2025-05-07T20:32:39.1270524Z     compiled=True,
2025-05-07T20:32:39.1270758Z )
2025-05-07T20:32:39.1270979Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1271364Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1271789Z     T=4096,
2025-05-07T20:32:39.1271999Z     D=7168,
2025-05-07T20:32:39.1272200Z     contiguous=True,
2025-05-07T20:32:39.1272448Z     compiled=True,
2025-05-07T20:32:39.1272674Z )
2025-05-07T20:32:39.1272877Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1273276Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1273694Z     T=2048,
2025-05-07T20:32:39.1273905Z     D=5120,
2025-05-07T20:32:39.1274109Z     contiguous=False,
2025-05-07T20:32:39.1274361Z     compiled=False,
2025-05-07T20:32:39.1274580Z )
2025-05-07T20:32:39.1274782Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1275268Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1275669Z     T=2048,
2025-05-07T20:32:39.1275862Z     D=5120,
2025-05-07T20:32:39.1276069Z     contiguous=True,
2025-05-07T20:32:39.1276301Z     compiled=False,
2025-05-07T20:32:39.1276513Z )
2025-05-07T20:32:39.1276719Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1277193Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1277582Z     T=128,
2025-05-07T20:32:39.1277785Z     D=7168,
2025-05-07T20:32:39.1277990Z     contiguous=False,
2025-05-07T20:32:39.1278216Z     compiled=True,
2025-05-07T20:32:39.1278435Z )
2025-05-07T20:32:39.1278639Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1279021Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1279422Z     T=16384,
2025-05-07T20:32:39.1279634Z     D=5120,
2025-05-07T20:32:39.1279833Z     contiguous=True,
2025-05-07T20:32:39.1280063Z     compiled=True,
2025-05-07T20:32:39.1280283Z )
2025-05-07T20:32:39.1280490Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1280891Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1281309Z     T=2048,
2025-05-07T20:32:39.1281517Z     D=5120,
2025-05-07T20:32:39.1281725Z     contiguous=False,
2025-05-07T20:32:39.1281981Z     compiled=True,
2025-05-07T20:32:39.1282212Z )
2025-05-07T20:32:39.1282420Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1282824Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1283233Z     T=16384,
2025-05-07T20:32:39.1283437Z     D=5120,
2025-05-07T20:32:39.1283657Z     contiguous=True,
2025-05-07T20:32:39.1283902Z     compiled=False,
2025-05-07T20:32:39.1284115Z )
2025-05-07T20:32:39.1284332Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1284734Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1285135Z     T=16384,
2025-05-07T20:32:39.1285354Z     D=7168,
2025-05-07T20:32:39.1285577Z     contiguous=False,
2025-05-07T20:32:39.1285818Z     compiled=False,
2025-05-07T20:32:39.1286053Z )
2025-05-07T20:32:39.1286277Z Trying example: test_silu_mul(
2025-05-07T20:32:39.1286663Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:39.1287074Z     T=16384,
2025-05-07T20:32:39.1287296Z     D=7168,
2025-05-07T20:32:39.1287510Z     contiguous=True,
2025-05-07T20:32:39.1287735Z     compiled=False,
2025-05-07T20:32:39.1287952Z )
2025-05-07T20:32:39.1288148Z PASSED
2025-05-07T20:32:39.1889125Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:39.1890416Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:39.1893288Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:39.1896520Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:39.1898573Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.1901016Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:39.1902476Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.1903685Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.1904991Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:39.1906581Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.1907712Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.1909068Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:39.1910393Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:39.1911741Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:39.1913023Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:39.1913894Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.1914980Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:39.1916057Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:39.1916890Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:32:39.1918176Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:39.1919537Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:39.1920723Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:39.1921878Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:39.1923129Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:39.1924578Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:39.1925844Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.1926802Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.1927724Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:39.1928800Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.2051211Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:39.2052327Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:39.2053732Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:39.2055433Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:39.2056465Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.2057851Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:39.2059324Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.2060363Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.2061667Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:39.2063130Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.2064253Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.2065604Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:39.2066924Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:39.2068221Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:39.2069507Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:39.2070376Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.2071456Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:39.2073512Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:39.2074358Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:32:39.2075640Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:39.2077108Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:39.2078276Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:39.2079383Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:39.2080626Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:39.2082063Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:39.2083173Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.2084117Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.2084890Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:39.2085965Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.2445819Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:39.2448025Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:39.2450444Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:39.2451944Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:39.2452970Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.2454347Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:39.2455923Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.2456954Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.2458412Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:39.2459883Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.2461126Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.2462480Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:39.2463808Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:39.2465102Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:39.2466391Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:39.2467268Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.2468356Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:39.2469430Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:39.2470283Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:32:39.2471609Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:39.2472979Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:39.2474166Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:39.2475266Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:39.2476526Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:39.2477964Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:39.2479089Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.2480043Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.2480866Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:39.2482020Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.2488511Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:39.2489904Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:39.2491602Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:39.2493354Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:39.2494618Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.2496233Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:39.2497952Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.2499157Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.2500666Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:39.2502373Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.2503681Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.2505267Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:39.2506802Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:39.2508304Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:39.2509792Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:39.2510858Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.2512113Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:39.2513362Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:39.2514316Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]            ^^^^^^^^^^^^^
2025-05-07T20:32:39.2515974Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:39.2517331Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:39.2518607Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:39.2519701Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:39.2520982Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:39.2522438Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:39.2523549Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.2524509Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.2525276Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:39.2526620Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.6753776Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.6754725Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.6755302Z     T=1,
2025-05-07T20:32:39.6755572Z     D=5120,
2025-05-07T20:32:39.6755847Z     scale_ub=None,
2025-05-07T20:32:39.6756165Z     contiguous=True,
2025-05-07T20:32:39.6756484Z     compiled=True,
2025-05-07T20:32:39.6756765Z )
2025-05-07T20:32:39.6757089Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.6757594Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:39.6757869Z 
2025-05-07T20:32:39.6757950Z     @given(
2025-05-07T20:32:39.6758189Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.6758503Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.6758851Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.6759198Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.6759549Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.6759841Z     )
2025-05-07T20:32:39.6760206Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.6760672Z     def test_silu_mul_quant(
2025-05-07T20:32:39.6760919Z         self,
2025-05-07T20:32:39.6761123Z         T: int,
2025-05-07T20:32:39.6761327Z         D: int,
2025-05-07T20:32:39.6761543Z         scale_ub: Optional[float],
2025-05-07T20:32:39.6761824Z         contiguous: bool,
2025-05-07T20:32:39.6762070Z         compiled: bool,
2025-05-07T20:32:39.6762303Z     ) -> None:
2025-05-07T20:32:39.6762519Z         torch.manual_seed(2025)
2025-05-07T20:32:39.6762769Z     
2025-05-07T20:32:39.6763049Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.6763402Z     
2025-05-07T20:32:39.6763599Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.6763903Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.6764218Z         x = x_sign * x_clamp
2025-05-07T20:32:39.6764689Z         x0 = x[:, :D]
2025-05-07T20:32:39.6764928Z         x1 = x[:, D:]
2025-05-07T20:32:39.6765140Z     
2025-05-07T20:32:39.6765336Z         if contiguous:
2025-05-07T20:32:39.6765582Z             x0 = x0.contiguous()
2025-05-07T20:32:39.6765850Z             x1 = x1.contiguous()
2025-05-07T20:32:39.6766276Z     
2025-05-07T20:32:39.6766482Z         if scale_ub is not None:
2025-05-07T20:32:39.6766766Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.6767119Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.6767444Z             )
2025-05-07T20:32:39.6767642Z         else:
2025-05-07T20:32:39.6767865Z             scale_ub_tensor = None
2025-05-07T20:32:39.6768133Z     
2025-05-07T20:32:39.6768375Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.6768694Z             op = silu_mul_quant
2025-05-07T20:32:39.6768953Z             if compiled:
2025-05-07T20:32:39.6769212Z                 op = torch.compile(op)
2025-05-07T20:32:39.6769521Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.6769814Z     
2025-05-07T20:32:39.6770013Z         y_fp8, y_scale = fn()
2025-05-07T20:32:39.6770303Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:39.6770607Z     
2025-05-07T20:32:39.6770867Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.6771213Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:39.6771511Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:39.6771857Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:39.6772228Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.6772554Z     
2025-05-07T20:32:39.6772756Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:39.6772965Z 
2025-05-07T20:32:39.6773070Z moe/activation_test.py:126: 
2025-05-07T20:32:39.6773378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.6773731Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:39.6774066Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.6774972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:39.6775779Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:39.6776353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.6786597Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.6787359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:39.6788144Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:39.6788920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:39.6789619Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:39.6790267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:39.6790829Z     fn()
2025-05-07T20:32:39.6791373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:39.6792006Z     self.fn.run(
2025-05-07T20:32:39.6792511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.6793080Z     kernel = self.compile(
2025-05-07T20:32:39.6793668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.6794376Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.6794815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.6795179Z 
2025-05-07T20:32:39.6795404Z self = <triton.compiler.compiler.ASTSource object at 0x7f12968c1520>
2025-05-07T20:32:39.6796554Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.6798089Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1295394540>}
2025-05-07T20:32:39.6799520Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.6800667Z context = <triton._C.libtriton.ir.context object at 0x7f12953900f0>
2025-05-07T20:32:39.6800982Z 
2025-05-07T20:32:39.6801167Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.6801722Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.6802214Z                            module_map=module_map)
2025-05-07T20:32:39.6802596Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.6802987Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:39.6803275Z E       ^
2025-05-07T20:32:39.6803768Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.6804243Z 
2025-05-07T20:32:39.6804686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.6805241Z 
2025-05-07T20:32:39.6805350Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.6805787Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.6806210Z     T=2048,
2025-05-07T20:32:39.6806410Z     D=5120,
2025-05-07T20:32:39.6806613Z     scale_ub=1200.0,
2025-05-07T20:32:39.6806844Z     contiguous=True,
2025-05-07T20:32:39.6807068Z     compiled=False,
2025-05-07T20:32:39.6807285Z )
2025-05-07T20:32:39.9817677Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:39.9819088Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:39.9820508Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:39.9822074Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:39.9823106Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.9824498Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:39.9826138Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.9827167Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.9828668Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:39.9830139Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.9831398Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.9832760Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:39.9834086Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:39.9835389Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:39.9836687Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:39.9837558Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:39.9838642Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:39.9839715Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:39.9840562Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:32:39.9841856Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:39.9843224Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:39.9844400Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:39.9845510Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:39.9846771Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:39.9848211Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:39.9849343Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.9850298Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.9851133Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:39.9852295Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.0636048Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:40.0637593Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:40.0639001Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:40.0640505Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:40.0641547Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.0642929Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:40.0644399Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.0645439Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.0646734Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:40.0648195Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.0649319Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.0650678Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:40.0652005Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:40.0653295Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:40.0654692Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:40.0655576Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.0656658Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:40.0657725Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:40.0658570Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:32:40.0659984Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:40.0661349Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:40.0662644Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:40.0663739Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:40.0664990Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:40.0666424Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:40.0667549Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.0668515Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.0669290Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:40.0670368Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.2991988Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:40.2993289Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:40.2994733Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:40.2996240Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:40.2997270Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.2998650Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:40.3000129Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.3001169Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.3002476Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:40.3004115Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.3005243Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.3006711Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:40.3008039Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:40.3009339Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:40.3010661Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:40.3011540Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.3012630Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:40.3013710Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:40.3014667Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:32:40.3015956Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:40.3017306Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:40.3018488Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:40.3019590Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:40.3020833Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:40.3022274Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:40.3023385Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.3024347Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.3025133Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:40.3026385Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.3095511Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:40.3097102Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:40.3098529Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:40.3100136Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:40.3101164Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.3102553Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:40.3104008Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.3105052Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.3106351Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:40.3107807Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.3108931Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.3110278Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:40.3111609Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:40.3112900Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:40.3114181Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:40.3115057Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.3116132Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:40.3117214Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:40.3118055Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]            ^^^^^^^^^^^^^
2025-05-07T20:32:40.3119336Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:40.3120767Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:40.3121948Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:40.3123125Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:40.3124374Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:40.3125973Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:40.3127095Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.3128056Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.3128839Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:40.3129914Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.6612660Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.6613547Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:40.6613958Z 
2025-05-07T20:32:40.6614070Z     @given(
2025-05-07T20:32:40.6614377Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.6614806Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.6615124Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.6615474Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.6615822Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.6616133Z     )
2025-05-07T20:32:40.6616488Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.6616956Z     def test_silu_mul_quant(
2025-05-07T20:32:40.6617214Z         self,
2025-05-07T20:32:40.6617409Z         T: int,
2025-05-07T20:32:40.6617614Z         D: int,
2025-05-07T20:32:40.6617843Z         scale_ub: Optional[float],
2025-05-07T20:32:40.6618122Z         contiguous: bool,
2025-05-07T20:32:40.6618366Z         compiled: bool,
2025-05-07T20:32:40.6618585Z     ) -> None:
2025-05-07T20:32:40.6618810Z         torch.manual_seed(2025)
2025-05-07T20:32:40.6619057Z     
2025-05-07T20:32:40.6619336Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.6619691Z     
2025-05-07T20:32:40.6619888Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.6620176Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.6620498Z         x = x_sign * x_clamp
2025-05-07T20:32:40.6620747Z         x0 = x[:, :D]
2025-05-07T20:32:40.6620967Z         x1 = x[:, D:]
2025-05-07T20:32:40.6621173Z     
2025-05-07T20:32:40.6621355Z         if contiguous:
2025-05-07T20:32:40.6621590Z             x0 = x0.contiguous()
2025-05-07T20:32:40.6621851Z             x1 = x1.contiguous()
2025-05-07T20:32:40.6622098Z     
2025-05-07T20:32:40.6622295Z         if scale_ub is not None:
2025-05-07T20:32:40.6622565Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.6622907Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.6623229Z             )
2025-05-07T20:32:40.6623422Z         else:
2025-05-07T20:32:40.6623641Z             scale_ub_tensor = None
2025-05-07T20:32:40.6623901Z     
2025-05-07T20:32:40.6624322Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.6624654Z             op = silu_mul_quant
2025-05-07T20:32:40.6624916Z             if compiled:
2025-05-07T20:32:40.6625166Z                 op = torch.compile(op)
2025-05-07T20:32:40.6625759Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6626049Z     
2025-05-07T20:32:40.6626251Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.6626419Z 
2025-05-07T20:32:40.6626522Z moe/activation_test.py:117: 
2025-05-07T20:32:40.6626827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6627177Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.6627465Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6628204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.6628943Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.6629523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.6630247Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.6630956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.6631530Z     kernel = self.compile(
2025-05-07T20:32:40.6632101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.6632798Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.6633215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6633454Z 
2025-05-07T20:32:40.6633672Z self = <triton.compiler.compiler.ASTSource object at 0x7f1295328890>
2025-05-07T20:32:40.6634803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.6636241Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1295c8f240>}
2025-05-07T20:32:40.6637660Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.6638760Z context = <triton._C.libtriton.ir.context object at 0x7f1295ea9f70>
2025-05-07T20:32:40.6639066Z 
2025-05-07T20:32:40.6639245Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.6639788Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.6640281Z                            module_map=module_map)
2025-05-07T20:32:40.6640672Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.6641042Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.6641327Z E       ^
2025-05-07T20:32:40.6641823Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.6642302Z 
2025-05-07T20:32:40.6642751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.6643296Z 
2025-05-07T20:32:40.6643404Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.6643840Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.6644270Z     T=2048,
2025-05-07T20:32:40.6644465Z     D=5120,
2025-05-07T20:32:40.6644671Z     scale_ub=1200.0,
2025-05-07T20:32:40.6644916Z     contiguous=True,
2025-05-07T20:32:40.6645150Z     compiled=True,
2025-05-07T20:32:40.6645378Z )
2025-05-07T20:32:40.6645858Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.6646379Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:40.6646678Z 
2025-05-07T20:32:40.6646765Z     @given(
2025-05-07T20:32:40.6647136Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.6647472Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.6647793Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.6648145Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.6648498Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.6648800Z     )
2025-05-07T20:32:40.6649175Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.6649647Z     def test_silu_mul_quant(
2025-05-07T20:32:40.6649906Z         self,
2025-05-07T20:32:40.6650132Z         T: int,
2025-05-07T20:32:40.6650355Z         D: int,
2025-05-07T20:32:40.6650600Z         scale_ub: Optional[float],
2025-05-07T20:32:40.6650886Z         contiguous: bool,
2025-05-07T20:32:40.6651147Z         compiled: bool,
2025-05-07T20:32:40.6651396Z     ) -> None:
2025-05-07T20:32:40.6651624Z         torch.manual_seed(2025)
2025-05-07T20:32:40.6651885Z     
2025-05-07T20:32:40.6652173Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.6652524Z     
2025-05-07T20:32:40.6652719Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.6653016Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.6653329Z         x = x_sign * x_clamp
2025-05-07T20:32:40.6653570Z         x0 = x[:, :D]
2025-05-07T20:32:40.6653787Z         x1 = x[:, D:]
2025-05-07T20:32:40.6653988Z     
2025-05-07T20:32:40.6654177Z         if contiguous:
2025-05-07T20:32:40.6654458Z             x0 = x0.contiguous()
2025-05-07T20:32:40.6654717Z             x1 = x1.contiguous()
2025-05-07T20:32:40.6654966Z     
2025-05-07T20:32:40.6655165Z         if scale_ub is not None:
2025-05-07T20:32:40.6655443Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.6655787Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.6656110Z             )
2025-05-07T20:32:40.6656318Z         else:
2025-05-07T20:32:40.6656542Z             scale_ub_tensor = None
2025-05-07T20:32:40.6656813Z     
2025-05-07T20:32:40.6657053Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.6657366Z             op = silu_mul_quant
2025-05-07T20:32:40.6657618Z             if compiled:
2025-05-07T20:32:40.6657871Z                 op = torch.compile(op)
2025-05-07T20:32:40.6658168Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6658456Z     
2025-05-07T20:32:40.6658652Z         y_fp8, y_scale = fn()
2025-05-07T20:32:40.6658940Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:40.6659255Z     
2025-05-07T20:32:40.6659506Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.6659855Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:40.6660164Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:40.6660501Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:40.6660927Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:40.6661248Z     
2025-05-07T20:32:40.6661459Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:40.6661657Z 
2025-05-07T20:32:40.6661767Z moe/activation_test.py:126: 
2025-05-07T20:32:40.6662069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6662419Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:40.6662762Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:40.6663580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:40.6664385Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:40.6665052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.6665783Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.6666509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:40.6667359Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:40.6668137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:40.6668819Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:40.6669457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:40.6670021Z     fn()
2025-05-07T20:32:40.6670570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:40.6671218Z     self.fn.run(
2025-05-07T20:32:40.6671742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.6672318Z     kernel = self.compile(
2025-05-07T20:32:40.6672898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.6673587Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.6674009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6674247Z 
2025-05-07T20:32:40.6674471Z self = <triton.compiler.compiler.ASTSource object at 0x7f1296786510>
2025-05-07T20:32:40.6675616Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.6677043Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f129679a020>}
2025-05-07T20:32:40.6678461Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.6679560Z context = <triton._C.libtriton.ir.context object at 0x7f12956082b0>
2025-05-07T20:32:40.6679860Z 
2025-05-07T20:32:40.6680042Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.6680582Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.6681126Z                            module_map=module_map)
2025-05-07T20:32:40.6681510Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.6681883Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:40.6682163Z E       ^
2025-05-07T20:32:40.6682647Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.6683117Z 
2025-05-07T20:32:40.6683556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.6684101Z 
2025-05-07T20:32:40.6684213Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.6684635Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.6685052Z     T=16384,
2025-05-07T20:32:40.6685252Z     D=7168,
2025-05-07T20:32:40.6685441Z     scale_ub=1200.0,
2025-05-07T20:32:40.6685663Z     contiguous=False,
2025-05-07T20:32:40.6685886Z     compiled=False,
2025-05-07T20:32:40.6686086Z )
2025-05-07T20:32:40.8598907Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:40.8600223Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:40.8601684Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:40.8603312Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:40.8604333Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.8605713Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:40.8607177Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.8608216Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.8609514Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:40.8610966Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.8612097Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.8613465Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:40.8614910Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:40.8616209Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:40.8617499Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:40.8618378Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.8619458Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:40.8620541Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:40.8621421Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:32:40.8622702Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:40.8624207Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:40.8625609Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:40.8626892Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:40.8628133Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:40.8629562Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:40.8630780Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.8632002Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.8632886Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:40.8641320Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.9196465Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:40.9197784Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:40.9199215Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:40.9200734Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:40.9201759Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.9203146Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:40.9204617Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.9205662Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.9207117Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:40.9208580Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.9209894Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.9211262Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:40.9212707Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:40.9214009Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:40.9215430Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:40.9216316Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:40.9217650Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:40.9218743Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:40.9219584Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:32:40.9220859Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:40.9222226Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:40.9223399Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:40.9224503Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:40.9225917Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:40.9227362Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:40.9228484Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.9229440Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.9230221Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:40.9231292Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.1109972Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:41.1111103Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:41.1112734Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:41.1114244Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:41.1115382Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:41.1116765Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:41.1118242Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.1119281Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:41.1120582Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:41.1122095Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.1123221Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:41.1124585Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:41.1126093Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:41.1127384Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:41.1128653Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:41.1129515Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:41.1130600Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:41.1131725Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:41.1132558Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:32:41.1133828Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:41.1135237Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:41.1136533Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:41.1137635Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:41.1138979Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:41.1140416Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:41.1141535Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.1142495Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.1143267Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:41.1144335Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.1206516Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:41.1208734Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:41.1211102Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:41.1212607Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:41.1213640Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:41.1215110Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:41.1216578Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.1217622Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:41.1218920Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:41.1220383Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.1221505Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:41.1222974Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:41.1224304Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:41.1225836Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:41.1227281Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:41.1228144Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:41.1229225Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:41.1230309Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:41.1231149Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]            ^^^^^^^^^^^^^
2025-05-07T20:32:41.1232433Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:41.1233793Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:41.1234980Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:41.1236085Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:41.1237335Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:41.1238769Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:41.1239884Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.1240845Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.1241626Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:41.1242694Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.8738810Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.8739606Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:41.8740014Z 
2025-05-07T20:32:41.8740141Z     @given(
2025-05-07T20:32:41.8740424Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.8740765Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.8741093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.8741448Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.8741798Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.8742108Z     )
2025-05-07T20:32:41.8742821Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.8743295Z     def test_silu_mul_quant(
2025-05-07T20:32:41.8743551Z         self,
2025-05-07T20:32:41.8743749Z         T: int,
2025-05-07T20:32:41.8743955Z         D: int,
2025-05-07T20:32:41.8744333Z         scale_ub: Optional[float],
2025-05-07T20:32:41.8744610Z         contiguous: bool,
2025-05-07T20:32:41.8744859Z         compiled: bool,
2025-05-07T20:32:41.8745092Z     ) -> None:
2025-05-07T20:32:41.8745309Z         torch.manual_seed(2025)
2025-05-07T20:32:41.8745562Z     
2025-05-07T20:32:41.8745852Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.8746202Z     
2025-05-07T20:32:41.8746407Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.8746709Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.8747023Z         x = x_sign * x_clamp
2025-05-07T20:32:41.8747275Z         x0 = x[:, :D]
2025-05-07T20:32:41.8747498Z         x1 = x[:, D:]
2025-05-07T20:32:41.8747715Z     
2025-05-07T20:32:41.8747907Z         if contiguous:
2025-05-07T20:32:41.8748147Z             x0 = x0.contiguous()
2025-05-07T20:32:41.8748415Z             x1 = x1.contiguous()
2025-05-07T20:32:41.8748661Z     
2025-05-07T20:32:41.8748862Z         if scale_ub is not None:
2025-05-07T20:32:41.8749153Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.8749492Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.8749814Z             )
2025-05-07T20:32:41.8750022Z         else:
2025-05-07T20:32:41.8750233Z             scale_ub_tensor = None
2025-05-07T20:32:41.8750503Z     
2025-05-07T20:32:41.8750750Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.8751112Z             op = silu_mul_quant
2025-05-07T20:32:41.8751382Z             if compiled:
2025-05-07T20:32:41.8751642Z                 op = torch.compile(op)
2025-05-07T20:32:41.8751946Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8752236Z     
2025-05-07T20:32:41.8752444Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.8752613Z 
2025-05-07T20:32:41.8752722Z moe/activation_test.py:117: 
2025-05-07T20:32:41.8753017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8753365Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.8753655Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8754377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.8755108Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.8755681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.8756401Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.8757095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.8757666Z     kernel = self.compile(
2025-05-07T20:32:41.8758238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.8758926Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.8759351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8759597Z 
2025-05-07T20:32:41.8759810Z self = <triton.compiler.compiler.ASTSource object at 0x7f1295c4c1d0>
2025-05-07T20:32:41.8760943Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.8762389Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1295e4a8e0>}
2025-05-07T20:32:41.8763890Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.8764984Z context = <triton._C.libtriton.ir.context object at 0x7f127b091130>
2025-05-07T20:32:41.8765367Z 
2025-05-07T20:32:41.8765551Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.8766114Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.8766602Z                            module_map=module_map)
2025-05-07T20:32:41.8766990Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.8767371Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.8767641Z E       ^
2025-05-07T20:32:41.8768134Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.8768610Z 
2025-05-07T20:32:41.8769068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.8769615Z 
2025-05-07T20:32:41.8769736Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8770168Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8770605Z     T=1,
2025-05-07T20:32:41.8770808Z     D=7168,
2025-05-07T20:32:41.8771009Z     scale_ub=None,
2025-05-07T20:32:41.8771240Z     contiguous=True,
2025-05-07T20:32:41.8771482Z     compiled=True,
2025-05-07T20:32:41.8771692Z )
2025-05-07T20:32:41.8772030Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.8772544Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:41.8772821Z 
2025-05-07T20:32:41.8772902Z     @given(
2025-05-07T20:32:41.8773148Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.8773476Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.8773799Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.8774148Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.8774635Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.8774954Z     )
2025-05-07T20:32:41.8775311Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.8775780Z     def test_silu_mul_quant(
2025-05-07T20:32:41.8776035Z         self,
2025-05-07T20:32:41.8776237Z         T: int,
2025-05-07T20:32:41.8776439Z         D: int,
2025-05-07T20:32:41.8776670Z         scale_ub: Optional[float],
2025-05-07T20:32:41.8776952Z         contiguous: bool,
2025-05-07T20:32:41.8777211Z         compiled: bool,
2025-05-07T20:32:41.8777445Z     ) -> None:
2025-05-07T20:32:41.8777660Z         torch.manual_seed(2025)
2025-05-07T20:32:41.8777918Z     
2025-05-07T20:32:41.8778209Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.8778566Z     
2025-05-07T20:32:41.8778783Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.8779093Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.8779412Z         x = x_sign * x_clamp
2025-05-07T20:32:41.8779668Z         x0 = x[:, :D]
2025-05-07T20:32:41.8779900Z         x1 = x[:, D:]
2025-05-07T20:32:41.8780127Z     
2025-05-07T20:32:41.8780320Z         if contiguous:
2025-05-07T20:32:41.8780565Z             x0 = x0.contiguous()
2025-05-07T20:32:41.8780834Z             x1 = x1.contiguous()
2025-05-07T20:32:41.8781083Z     
2025-05-07T20:32:41.8781314Z         if scale_ub is not None:
2025-05-07T20:32:41.8781604Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.8781943Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.8782261Z             )
2025-05-07T20:32:41.8782456Z         else:
2025-05-07T20:32:41.8782672Z             scale_ub_tensor = None
2025-05-07T20:32:41.8782945Z     
2025-05-07T20:32:41.8783271Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.8783598Z             op = silu_mul_quant
2025-05-07T20:32:41.8783864Z             if compiled:
2025-05-07T20:32:41.8784125Z                 op = torch.compile(op)
2025-05-07T20:32:41.8784436Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8784809Z     
2025-05-07T20:32:41.8785014Z         y_fp8, y_scale = fn()
2025-05-07T20:32:41.8785308Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:41.8785625Z     
2025-05-07T20:32:41.8785879Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.8786232Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:41.8786537Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:41.8786867Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:41.8787248Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:41.8787571Z     
2025-05-07T20:32:41.8787789Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:41.8788001Z 
2025-05-07T20:32:41.8788112Z moe/activation_test.py:126: 
2025-05-07T20:32:41.8788415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8788773Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:41.8789126Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:41.8789963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:41.8790754Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:41.8791379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.8792104Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.8792833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:41.8793595Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:41.8794366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:41.8795041Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:41.8795669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:41.8796213Z     fn()
2025-05-07T20:32:41.8796748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:41.8797362Z     self.fn.run(
2025-05-07T20:32:41.8797845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.8798404Z     kernel = self.compile(
2025-05-07T20:32:41.8798970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.8799657Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.8800066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8800310Z 
2025-05-07T20:32:41.8800533Z self = <triton.compiler.compiler.ASTSource object at 0x7f129679e900>
2025-05-07T20:32:41.8801665Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.8803087Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f129587c860>}
2025-05-07T20:32:41.8804571Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.8805659Z context = <triton._C.libtriton.ir.context object at 0x7f127ae28d70>
2025-05-07T20:32:41.8805959Z 
2025-05-07T20:32:41.8806136Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.8806793Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.8807269Z                            module_map=module_map)
2025-05-07T20:32:41.8807645Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.8808013Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:41.8808282Z E       ^
2025-05-07T20:32:41.8808763Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.8809233Z 
2025-05-07T20:32:41.8809676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.8810214Z 
2025-05-07T20:32:41.8810332Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8810752Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8811169Z     T=4096,
2025-05-07T20:32:41.8811364Z     D=5120,
2025-05-07T20:32:41.8811562Z     scale_ub=None,
2025-05-07T20:32:41.8811779Z     contiguous=False,
2025-05-07T20:32:41.8812009Z     compiled=False,
2025-05-07T20:32:41.8812213Z )
2025-05-07T20:32:42.1777050Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:42.1778197Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:42.1779652Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:42.1781259Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:42.1782294Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.1783661Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:42.1785123Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1786158Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.1787450Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:42.1788916Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1790029Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.1791761Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:42.1793088Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:42.1794532Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:42.1795806Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:42.1796668Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.1797749Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:42.1798823Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:42.1799657Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:32:42.1800944Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:42.1802289Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:42.1803461Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:42.1804562Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:42.1805803Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:42.1807236Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:42.1808338Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1809287Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.1810062Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:42.1811131Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.3828874Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:42.3830006Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:42.3831420Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:42.3833307Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:42.3834339Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.3835865Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:42.3837326Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.3838361Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.3839663Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:42.3841153Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.3842300Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.3843643Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:42.3844967Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:42.3846253Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:42.3847536Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:42.3848406Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.3849480Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:42.3850563Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:42.3851397Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:32:42.3852675Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:42.3854027Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:42.3855364Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:42.3856460Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:42.3857788Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:42.3859326Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:42.3860433Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.3861439Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.3862212Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:42.3863286Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6756921Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:42.6759200Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:42.6761567Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:42.6763088Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:42.6764129Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.6765498Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:42.6766963Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6767997Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.6769299Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:42.6770754Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6771947Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.6773298Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:42.6774896Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:42.6776520Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:42.6777803Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:42.6778833Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.6779911Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:42.6780981Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:42.6781812Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:32:42.6783145Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:42.6784700Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:42.6785882Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:42.6795378Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:42.6796665Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:42.6798116Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:42.6799248Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6800200Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6800985Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:42.6802112Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.6858981Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:42.6860070Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:42.6861494Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:42.6862989Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:42.6864235Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.6865622Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:42.6867222Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.6868251Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.6869557Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:42.6871025Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.6872206Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.6873574Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:42.6874891Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:42.6876192Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:42.6877469Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:42.6878345Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:42.6879428Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:42.6880495Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:42.6881331Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]            ^^^^^^^^^^^^^
2025-05-07T20:32:42.6882615Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:42.6883973Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:42.6885231Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:42.6886452Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:42.6887779Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:42.6889311Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:42.6890434Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.6891517Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.6892303Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:42.6893375Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.0506927Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.0507548Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.0507851Z 
2025-05-07T20:32:44.0507945Z     @given(
2025-05-07T20:32:44.0508189Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.0508522Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.0508856Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.0509205Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.0509545Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.0509852Z     )
2025-05-07T20:32:44.0510217Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.0510675Z     def test_silu_mul_quant(
2025-05-07T20:32:44.0510926Z         self,
2025-05-07T20:32:44.0511131Z         T: int,
2025-05-07T20:32:44.0511328Z         D: int,
2025-05-07T20:32:44.0511553Z         scale_ub: Optional[float],
2025-05-07T20:32:44.0511835Z         contiguous: bool,
2025-05-07T20:32:44.0512086Z         compiled: bool,
2025-05-07T20:32:44.0512325Z     ) -> None:
2025-05-07T20:32:44.0512549Z         torch.manual_seed(2025)
2025-05-07T20:32:44.0512797Z     
2025-05-07T20:32:44.0513093Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.0513467Z     
2025-05-07T20:32:44.0513659Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.0513963Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.0514283Z         x = x_sign * x_clamp
2025-05-07T20:32:44.0514533Z         x0 = x[:, :D]
2025-05-07T20:32:44.0514753Z         x1 = x[:, D:]
2025-05-07T20:32:44.0514972Z     
2025-05-07T20:32:44.0515170Z         if contiguous:
2025-05-07T20:32:44.0515403Z             x0 = x0.contiguous()
2025-05-07T20:32:44.0515670Z             x1 = x1.contiguous()
2025-05-07T20:32:44.0515919Z     
2025-05-07T20:32:44.0516108Z         if scale_ub is not None:
2025-05-07T20:32:44.0516390Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.0516743Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.0517062Z             )
2025-05-07T20:32:44.0517264Z         else:
2025-05-07T20:32:44.0517485Z             scale_ub_tensor = None
2025-05-07T20:32:44.0517740Z     
2025-05-07T20:32:44.0517982Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.0518323Z             op = silu_mul_quant
2025-05-07T20:32:44.0518571Z             if compiled:
2025-05-07T20:32:44.0518832Z                 op = torch.compile(op)
2025-05-07T20:32:44.0519142Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.0519436Z     
2025-05-07T20:32:44.0519627Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.0519804Z 
2025-05-07T20:32:44.0519909Z moe/activation_test.py:117: 
2025-05-07T20:32:44.0520231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.0520570Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.0520868Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.0522007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.0522752Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.0523328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.0524270Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.0525003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.0525934Z     kernel = self.compile(
2025-05-07T20:32:44.0526531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.0527252Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.0527693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.0527941Z 
2025-05-07T20:32:44.0528171Z self = <triton.compiler.compiler.ASTSource object at 0x7f12959ffd10>
2025-05-07T20:32:44.0529321Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.0530796Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127b0a8180>}
2025-05-07T20:32:44.0532232Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.0533331Z context = <triton._C.libtriton.ir.context object at 0x7f127aef8230>
2025-05-07T20:32:44.0533647Z 
2025-05-07T20:32:44.0533826Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.0534495Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.0535028Z                            module_map=module_map)
2025-05-07T20:32:44.0535410Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.0535796Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.0536079Z E       ^
2025-05-07T20:32:44.0536569Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.0537058Z 
2025-05-07T20:32:44.0537503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.0538063Z 
2025-05-07T20:32:44.0538172Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.0538612Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.0539041Z     T=4096,
2025-05-07T20:32:44.0539254Z     D=7168,
2025-05-07T20:32:44.0539452Z     scale_ub=None,
2025-05-07T20:32:44.0539667Z     contiguous=False,
2025-05-07T20:32:44.0539905Z     compiled=False,
2025-05-07T20:32:44.0540118Z )
2025-05-07T20:32:44.0540442Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.0540968Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.0541265Z 
2025-05-07T20:32:44.0541347Z     @given(
2025-05-07T20:32:44.0541592Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.0541917Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.0542240Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.0542587Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.0542930Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.0543234Z     )
2025-05-07T20:32:44.0543600Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.0544268Z     def test_silu_mul_quant(
2025-05-07T20:32:44.0544526Z         self,
2025-05-07T20:32:44.0544739Z         T: int,
2025-05-07T20:32:44.0544971Z         D: int,
2025-05-07T20:32:44.0545199Z         scale_ub: Optional[float],
2025-05-07T20:32:44.0545626Z         contiguous: bool,
2025-05-07T20:32:44.0545885Z         compiled: bool,
2025-05-07T20:32:44.0546112Z     ) -> None:
2025-05-07T20:32:44.0546336Z         torch.manual_seed(2025)
2025-05-07T20:32:44.0546584Z     
2025-05-07T20:32:44.0546872Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.0547227Z     
2025-05-07T20:32:44.0547427Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.0547729Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.0548046Z         x = x_sign * x_clamp
2025-05-07T20:32:44.0548299Z         x0 = x[:, :D]
2025-05-07T20:32:44.0548525Z         x1 = x[:, D:]
2025-05-07T20:32:44.0548733Z     
2025-05-07T20:32:44.0548925Z         if contiguous:
2025-05-07T20:32:44.0549170Z             x0 = x0.contiguous()
2025-05-07T20:32:44.0549434Z             x1 = x1.contiguous()
2025-05-07T20:32:44.0549686Z     
2025-05-07T20:32:44.0549890Z         if scale_ub is not None:
2025-05-07T20:32:44.0550165Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.0550526Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.0550853Z             )
2025-05-07T20:32:44.0551052Z         else:
2025-05-07T20:32:44.0551277Z             scale_ub_tensor = None
2025-05-07T20:32:44.0551577Z     
2025-05-07T20:32:44.0551844Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.0552166Z             op = silu_mul_quant
2025-05-07T20:32:44.0552427Z             if compiled:
2025-05-07T20:32:44.0552687Z                 op = torch.compile(op)
2025-05-07T20:32:44.0552989Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.0553283Z     
2025-05-07T20:32:44.0553481Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.0553648Z 
2025-05-07T20:32:44.0553752Z moe/activation_test.py:117: 
2025-05-07T20:32:44.0554067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.0554414Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.0554704Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.0555442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.0556307Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.0556884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.0557697Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.0558487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.0559132Z     kernel = self.compile(
2025-05-07T20:32:44.0559719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.0560414Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.0560832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.0561081Z 
2025-05-07T20:32:44.0561302Z self = <triton.compiler.compiler.ASTSource object at 0x7f127b2f4410>
2025-05-07T20:32:44.0562426Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.0563875Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127b0a9080>}
2025-05-07T20:32:44.0566261Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.0567364Z context = <triton._C.libtriton.ir.context object at 0x7f127a843bf0>
2025-05-07T20:32:44.0567752Z 
2025-05-07T20:32:44.0567921Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.0568460Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.0568946Z                            module_map=module_map)
2025-05-07T20:32:44.0569315Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.0569676Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.0569940Z E       ^
2025-05-07T20:32:44.0570413Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.0570891Z 
2025-05-07T20:32:44.0571338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.0571892Z 
2025-05-07T20:32:44.0571996Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.0572422Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.0572837Z     T=128,
2025-05-07T20:32:44.0573031Z     D=7168,
2025-05-07T20:32:44.0573224Z     scale_ub=None,
2025-05-07T20:32:44.0573437Z     contiguous=False,
2025-05-07T20:32:44.0573663Z     compiled=True,
2025-05-07T20:32:44.0573872Z )
2025-05-07T20:32:44.1149753Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.1150302Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.1150587Z 
2025-05-07T20:32:44.1150670Z     @given(
2025-05-07T20:32:44.1150910Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.1151237Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.1151598Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.1151974Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.1152315Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.1152617Z     )
2025-05-07T20:32:44.1152975Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.1153450Z     def test_silu_mul_quant(
2025-05-07T20:32:44.1153701Z         self,
2025-05-07T20:32:44.1153898Z         T: int,
2025-05-07T20:32:44.1154099Z         D: int,
2025-05-07T20:32:44.1154326Z         scale_ub: Optional[float],
2025-05-07T20:32:44.1154603Z         contiguous: bool,
2025-05-07T20:32:44.1154852Z         compiled: bool,
2025-05-07T20:32:44.1155087Z     ) -> None:
2025-05-07T20:32:44.1155300Z         torch.manual_seed(2025)
2025-05-07T20:32:44.1155552Z     
2025-05-07T20:32:44.1155842Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.1156195Z     
2025-05-07T20:32:44.1156394Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.1156700Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.1157017Z         x = x_sign * x_clamp
2025-05-07T20:32:44.1157264Z         x0 = x[:, :D]
2025-05-07T20:32:44.1157489Z         x1 = x[:, D:]
2025-05-07T20:32:44.1157706Z     
2025-05-07T20:32:44.1157890Z         if contiguous:
2025-05-07T20:32:44.1158134Z             x0 = x0.contiguous()
2025-05-07T20:32:44.1158405Z             x1 = x1.contiguous()
2025-05-07T20:32:44.1158652Z     
2025-05-07T20:32:44.1158853Z         if scale_ub is not None:
2025-05-07T20:32:44.1159137Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.1159475Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.1159797Z             )
2025-05-07T20:32:44.1159997Z         else:
2025-05-07T20:32:44.1160210Z             scale_ub_tensor = None
2025-05-07T20:32:44.1160473Z     
2025-05-07T20:32:44.1160710Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.1161265Z             op = silu_mul_quant
2025-05-07T20:32:44.1161528Z             if compiled:
2025-05-07T20:32:44.1161806Z                 op = torch.compile(op)
2025-05-07T20:32:44.1162127Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.1162411Z     
2025-05-07T20:32:44.1162765Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.1163057Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.1163355Z     
2025-05-07T20:32:44.1163598Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.1163945Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.1164239Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.1164564Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.1164935Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.1165251Z     
2025-05-07T20:32:44.1165458Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.1165683Z 
2025-05-07T20:32:44.1165796Z moe/activation_test.py:126: 
2025-05-07T20:32:44.1166105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.1166455Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.1166797Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.1167655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.1168587Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.1169158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.1169964Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.1170741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.1171609Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.1172393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.1173073Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.1173721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.1174268Z     fn()
2025-05-07T20:32:44.1174936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.1175556Z     self.fn.run(
2025-05-07T20:32:44.1176038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.1176600Z     kernel = self.compile(
2025-05-07T20:32:44.1177168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.1177862Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.1178266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.1178510Z 
2025-05-07T20:32:44.1178723Z self = <triton.compiler.compiler.ASTSource object at 0x7f127b0e0a70>
2025-05-07T20:32:44.1179854Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.1181294Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127b0a9f80>}
2025-05-07T20:32:44.1182703Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.1183890Z context = <triton._C.libtriton.ir.context object at 0x7f127a5b8330>
2025-05-07T20:32:44.1184204Z 
2025-05-07T20:32:44.1184379Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.1184933Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.1185528Z                            module_map=module_map)
2025-05-07T20:32:44.1185910Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.1186289Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.1186574Z E       ^
2025-05-07T20:32:44.1187057Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.1187539Z 
2025-05-07T20:32:44.1187979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.1188524Z 
2025-05-07T20:32:44.1188646Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.1189080Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.1189500Z     T=128,
2025-05-07T20:32:44.1189700Z     D=7168,
2025-05-07T20:32:44.1189900Z     scale_ub=None,
2025-05-07T20:32:44.1190122Z     contiguous=False,
2025-05-07T20:32:44.1190361Z     compiled=False,
2025-05-07T20:32:44.1190576Z )
2025-05-07T20:32:44.3138811Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3139381Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.3139663Z 
2025-05-07T20:32:44.3139749Z     @given(
2025-05-07T20:32:44.3139985Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3140303Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3140606Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3140940Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3141304Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3141591Z     )
2025-05-07T20:32:44.3141944Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3142400Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3142652Z         self,
2025-05-07T20:32:44.3142842Z         T: int,
2025-05-07T20:32:44.3143039Z         D: int,
2025-05-07T20:32:44.3143259Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3143527Z         contiguous: bool,
2025-05-07T20:32:44.3143768Z         compiled: bool,
2025-05-07T20:32:44.3144001Z     ) -> None:
2025-05-07T20:32:44.3144219Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3144466Z     
2025-05-07T20:32:44.3144744Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3145090Z     
2025-05-07T20:32:44.3145286Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3145579Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3145885Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3146132Z         x0 = x[:, :D]
2025-05-07T20:32:44.3146348Z         x1 = x[:, D:]
2025-05-07T20:32:44.3146546Z     
2025-05-07T20:32:44.3146733Z         if contiguous:
2025-05-07T20:32:44.3146971Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3147235Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3147473Z     
2025-05-07T20:32:44.3147665Z         if scale_ub is not None:
2025-05-07T20:32:44.3147940Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3148271Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3148584Z             )
2025-05-07T20:32:44.3148773Z         else:
2025-05-07T20:32:44.3148977Z             scale_ub_tensor = None
2025-05-07T20:32:44.3149231Z     
2025-05-07T20:32:44.3149457Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3149767Z             op = silu_mul_quant
2025-05-07T20:32:44.3150017Z             if compiled:
2025-05-07T20:32:44.3150266Z                 op = torch.compile(op)
2025-05-07T20:32:44.3150907Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3151199Z     
2025-05-07T20:32:44.3151398Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3151569Z 
2025-05-07T20:32:44.3151693Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3152165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3152512Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3152800Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3153520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3154251Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3154812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3155528Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3156233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3156802Z     kernel = self.compile(
2025-05-07T20:32:44.3157366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3158054Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3158476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3158720Z 
2025-05-07T20:32:44.3158934Z self = <triton.compiler.compiler.ASTSource object at 0x7f127a77d9d0>
2025-05-07T20:32:44.3160064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3161515Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ab52340>}
2025-05-07T20:32:44.3162935Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3164023Z context = <triton._C.libtriton.ir.context object at 0x7f127a444ab0>
2025-05-07T20:32:44.3164330Z 
2025-05-07T20:32:44.3164501Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3165048Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3165540Z                            module_map=module_map)
2025-05-07T20:32:44.3165910Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3166274Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3175723Z E       ^
2025-05-07T20:32:44.3176266Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3176754Z 
2025-05-07T20:32:44.3177199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3177757Z 
2025-05-07T20:32:44.3177876Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3178305Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3178735Z     T=4096,
2025-05-07T20:32:44.3178939Z     D=5120,
2025-05-07T20:32:44.3179145Z     scale_ub=1200.0,
2025-05-07T20:32:44.3179371Z     contiguous=True,
2025-05-07T20:32:44.3179602Z     compiled=False,
2025-05-07T20:32:44.3179823Z )
2025-05-07T20:32:44.3180150Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.3180668Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:44.3180955Z 
2025-05-07T20:32:44.3181045Z     @given(
2025-05-07T20:32:44.3181393Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.3181728Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.3182101Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.3182437Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.3182865Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.3183168Z     )
2025-05-07T20:32:44.3183531Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.3183991Z     def test_silu_mul_quant(
2025-05-07T20:32:44.3184246Z         self,
2025-05-07T20:32:44.3184456Z         T: int,
2025-05-07T20:32:44.3184650Z         D: int,
2025-05-07T20:32:44.3184874Z         scale_ub: Optional[float],
2025-05-07T20:32:44.3185153Z         contiguous: bool,
2025-05-07T20:32:44.3185389Z         compiled: bool,
2025-05-07T20:32:44.3185625Z     ) -> None:
2025-05-07T20:32:44.3185844Z         torch.manual_seed(2025)
2025-05-07T20:32:44.3186086Z     
2025-05-07T20:32:44.3186373Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.3186734Z     
2025-05-07T20:32:44.3186925Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.3187226Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.3187552Z         x = x_sign * x_clamp
2025-05-07T20:32:44.3187791Z         x0 = x[:, :D]
2025-05-07T20:32:44.3188008Z         x1 = x[:, D:]
2025-05-07T20:32:44.3188222Z     
2025-05-07T20:32:44.3188406Z         if contiguous:
2025-05-07T20:32:44.3188645Z             x0 = x0.contiguous()
2025-05-07T20:32:44.3188910Z             x1 = x1.contiguous()
2025-05-07T20:32:44.3189164Z     
2025-05-07T20:32:44.3189355Z         if scale_ub is not None:
2025-05-07T20:32:44.3189641Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.3189987Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.3190302Z             )
2025-05-07T20:32:44.3190499Z         else:
2025-05-07T20:32:44.3190719Z             scale_ub_tensor = None
2025-05-07T20:32:44.3190971Z     
2025-05-07T20:32:44.3191209Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.3191533Z             op = silu_mul_quant
2025-05-07T20:32:44.3191781Z             if compiled:
2025-05-07T20:32:44.3192044Z                 op = torch.compile(op)
2025-05-07T20:32:44.3192351Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3192631Z     
2025-05-07T20:32:44.3192828Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.3193003Z 
2025-05-07T20:32:44.3193102Z moe/activation_test.py:117: 
2025-05-07T20:32:44.3193408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3193751Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.3194044Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.3194770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.3195501Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.3196061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.3196780Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.3197483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.3198037Z     kernel = self.compile(
2025-05-07T20:32:44.3198607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.3199299Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.3199704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.3199951Z 
2025-05-07T20:32:44.3200162Z self = <triton.compiler.compiler.ASTSource object at 0x7f127a77e240>
2025-05-07T20:32:44.3201376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.3202867Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ab51440>}
2025-05-07T20:32:44.3204361Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.3205448Z context = <triton._C.libtriton.ir.context object at 0x7f127a49fa70>
2025-05-07T20:32:44.3205756Z 
2025-05-07T20:32:44.3205928Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.3206482Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.3206972Z                            module_map=module_map)
2025-05-07T20:32:44.3207346Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.3207717Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.3207996Z E       ^
2025-05-07T20:32:44.3208602Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.3209132Z 
2025-05-07T20:32:44.3209647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.3210208Z 
2025-05-07T20:32:44.3210318Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.3210815Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.3211301Z     T=1,
2025-05-07T20:32:44.3211503Z     D=5120,
2025-05-07T20:32:44.3211707Z     scale_ub=None,
2025-05-07T20:32:44.3211933Z     contiguous=True,
2025-05-07T20:32:44.3212178Z     compiled=True,
2025-05-07T20:32:44.3212391Z )
2025-05-07T20:32:44.5858919Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:44.5860206Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:44.5861812Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:44.5863377Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:44.5864428Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.5865826Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:44.5867304Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.5868353Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.5870036Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:44.5871513Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.5872643Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.5874140Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:44.5875472Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:44.5876780Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:44.5878068Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:44.5878951Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.5880029Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:44.5881108Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:44.5882003Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:32:44.5883300Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:44.5884664Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:44.5885844Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:44.5886962Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:44.5888224Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:44.5889662Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:44.5890771Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.5891738Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.5892517Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:44.5893591Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.6567920Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:44.6569173Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:44.6570779Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:44.6572337Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:44.6573387Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.6574919Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:44.6576393Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.6577442Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.6578749Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:44.6580210Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.6581330Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.6582694Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:44.6584016Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:44.6585314Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:44.6586593Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:44.6587463Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.6588549Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:44.6589631Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:44.6590461Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:32:44.6591828Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:44.6593246Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:44.6594503Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:44.6595605Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:44.6596848Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:44.6598292Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:44.6599411Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.6600382Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.6601164Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:44.6602243Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8633594Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:44.8635017Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:44.8636441Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:44.8637954Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:44.8638978Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8640367Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:44.8641835Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8642925Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8644225Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:44.8645670Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8646974Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8648333Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:44.8649759Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:44.8651047Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:44.8652324Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:44.8653188Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8654268Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:44.8655426Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:44.8656258Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:32:44.8657520Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:44.8658875Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:44.8660048Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:44.8661150Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:44.8662399Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:44.8663820Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:44.8664937Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8665888Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8666664Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:44.8667731Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.8733872Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:44.8735242Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:44.8736819Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:44.8738469Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:44.8739500Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8740884Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:44.8742348Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.8743388Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8744693Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:44.8746145Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.8747275Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8748619Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:44.8749938Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:44.8751221Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:44.8752553Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:44.8753428Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:44.8754508Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:44.8755582Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:44.8756421Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]            ^^^^^^^^^^^^^
2025-05-07T20:32:44.8757701Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:44.8759047Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:44.8760308Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:44.8761415Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:44.8762783Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:44.8764219Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:44.8765325Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.8766285Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.8767059Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:44.8768134Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.0941881Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:45.0942687Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:45.0943061Z 
2025-05-07T20:32:45.0943173Z     @given(
2025-05-07T20:32:45.0943496Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:45.0943852Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:45.0944171Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:45.0944537Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:45.0944889Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:45.0945193Z     )
2025-05-07T20:32:45.0945554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:45.0946029Z     def test_silu_mul_quant(
2025-05-07T20:32:45.0946283Z         self,
2025-05-07T20:32:45.0946474Z         T: int,
2025-05-07T20:32:45.0946676Z         D: int,
2025-05-07T20:32:45.0946901Z         scale_ub: Optional[float],
2025-05-07T20:32:45.0947175Z         contiguous: bool,
2025-05-07T20:32:45.0947427Z         compiled: bool,
2025-05-07T20:32:45.0947659Z     ) -> None:
2025-05-07T20:32:45.0947878Z         torch.manual_seed(2025)
2025-05-07T20:32:45.0948132Z     
2025-05-07T20:32:45.0948417Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:45.0948771Z     
2025-05-07T20:32:45.0948973Z         x_sign = torch.sign(x)
2025-05-07T20:32:45.0949280Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:45.0949591Z         x = x_sign * x_clamp
2025-05-07T20:32:45.0949834Z         x0 = x[:, :D]
2025-05-07T20:32:45.0950053Z         x1 = x[:, D:]
2025-05-07T20:32:45.0950259Z     
2025-05-07T20:32:45.0950451Z         if contiguous:
2025-05-07T20:32:45.0950692Z             x0 = x0.contiguous()
2025-05-07T20:32:45.0950951Z             x1 = x1.contiguous()
2025-05-07T20:32:45.0951197Z     
2025-05-07T20:32:45.0951387Z         if scale_ub is not None:
2025-05-07T20:32:45.0951667Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:45.0952002Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:45.0952368Z             )
2025-05-07T20:32:45.0952565Z         else:
2025-05-07T20:32:45.0952770Z             scale_ub_tensor = None
2025-05-07T20:32:45.0953029Z     
2025-05-07T20:32:45.0953266Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.0953580Z             op = silu_mul_quant
2025-05-07T20:32:45.0954005Z             if compiled:
2025-05-07T20:32:45.0954263Z                 op = torch.compile(op)
2025-05-07T20:32:45.0954559Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.0954852Z     
2025-05-07T20:32:45.0955049Z         y_fp8, y_scale = fn()
2025-05-07T20:32:45.0955447Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:45.0955748Z     
2025-05-07T20:32:45.0955986Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.0956331Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:45.0956623Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:45.0956942Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:45.0957313Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:45.0957631Z     
2025-05-07T20:32:45.0957837Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:45.0958037Z 
2025-05-07T20:32:45.0958146Z moe/activation_test.py:126: 
2025-05-07T20:32:45.0958450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.0958798Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:45.0959134Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:45.0959970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:45.0960766Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:45.0961340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:45.0962060Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:45.0962782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:45.0963544Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:45.0964318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:45.0964996Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:45.0965620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:45.0966174Z     fn()
2025-05-07T20:32:45.0966701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:45.0967314Z     self.fn.run(
2025-05-07T20:32:45.0967791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:45.0968347Z     kernel = self.compile(
2025-05-07T20:32:45.0968907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:45.0969585Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.0970013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.0970259Z 
2025-05-07T20:32:45.0970467Z self = <triton.compiler.compiler.ASTSource object at 0x7f1295a78ec0>
2025-05-07T20:32:45.0971593Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:45.0973031Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ab53060>}
2025-05-07T20:32:45.0974524Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:45.0976143Z context = <triton._C.libtriton.ir.context object at 0x7f127a4b1170>
2025-05-07T20:32:45.0976453Z 
2025-05-07T20:32:45.0976622Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:45.0977170Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.0977733Z                            module_map=module_map)
2025-05-07T20:32:45.0978113Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.0978488Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:45.0978760Z E       ^
2025-05-07T20:32:45.0979246Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.0979726Z 
2025-05-07T20:32:45.0980162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:45.0980705Z 
2025-05-07T20:32:45.0980819Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.0981249Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.0981677Z     T=2048,
2025-05-07T20:32:45.0981870Z     D=5120,
2025-05-07T20:32:45.0982058Z     scale_ub=None,
2025-05-07T20:32:45.0982270Z     contiguous=True,
2025-05-07T20:32:45.0982502Z     compiled=True,
2025-05-07T20:32:45.0982700Z )
2025-05-07T20:32:45.3370585Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:45.3371903Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:45.3373371Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:45.3374993Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:45.3376039Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.3377418Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:45.3378879Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.3379922Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.3381224Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:45.3382736Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.3383862Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.3385224Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:45.3386747Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:45.3388046Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:45.3389438Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:45.3390312Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.3391401Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:45.3392487Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:45.3393316Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:32:45.3394601Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:45.3395968Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:45.3397150Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:45.3398264Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:45.3399506Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:45.3400952Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:45.3402098Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.3403083Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:45.3403859Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:45.3404932Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.4070457Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:45.4071820Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:45.4074171Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:45.4077592Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:45.4079638Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.4082373Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:45.4083971Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.4085013Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.4086320Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:45.4087787Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.4088917Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.4090270Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:45.4091592Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:45.4092890Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:45.4094174Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:45.4095167Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.4096244Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:45.4097323Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:45.4106559Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:32:45.4107866Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:45.4109236Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:45.4110419Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:45.4111526Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:45.4112878Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:45.4114321Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:45.4115521Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.4116486Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:45.4117259Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:45.4118344Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.6116792Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:45.6118123Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:45.6119536Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:45.6121044Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:45.6122083Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.6123466Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:45.6124938Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.6126143Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.6127441Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:45.6128906Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.6130027Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.6131385Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:45.6132709Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:45.6134188Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:45.6135585Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:45.6136586Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.6137673Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:45.6138748Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:45.6139581Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:32:45.6140872Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:45.6142231Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:45.6143467Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:45.6144575Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:45.6145813Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:45.6147259Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:45.6148377Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.6149339Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:45.6150111Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:45.6151551Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.6217406Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:45.6218544Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:45.6219976Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:45.6221491Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:45.6222577Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.6224135Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:45.6226056Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.6227664Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.6228981Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:45.6230465Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.6231590Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.6232957Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:45.6234286Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:45.6235585Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:45.6236873Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:45.6237737Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:45.6238834Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:45.6239920Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:45.6240766Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]            ^^^^^^^^^^^^^
2025-05-07T20:32:45.6242058Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:45.6243420Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:45.6244613Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:45.6245727Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:45.6246985Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:45.6248571Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:45.6249693Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.6250661Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:45.6251535Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:45.6252664Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.8306683Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:45.8307265Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:45.8307559Z 
2025-05-07T20:32:45.8307674Z     @given(
2025-05-07T20:32:45.8307922Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:45.8308263Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:45.8308582Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:45.8308945Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:45.8309296Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:45.8309600Z     )
2025-05-07T20:32:45.8309965Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:45.8310432Z     def test_silu_mul_quant(
2025-05-07T20:32:45.8310676Z         self,
2025-05-07T20:32:45.8310884Z         T: int,
2025-05-07T20:32:45.8311088Z         D: int,
2025-05-07T20:32:45.8311303Z         scale_ub: Optional[float],
2025-05-07T20:32:45.8311592Z         contiguous: bool,
2025-05-07T20:32:45.8311843Z         compiled: bool,
2025-05-07T20:32:45.8312068Z     ) -> None:
2025-05-07T20:32:45.8312304Z         torch.manual_seed(2025)
2025-05-07T20:32:45.8312561Z     
2025-05-07T20:32:45.8312847Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:45.8313197Z     
2025-05-07T20:32:45.8313399Z         x_sign = torch.sign(x)
2025-05-07T20:32:45.8313705Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:45.8314026Z         x = x_sign * x_clamp
2025-05-07T20:32:45.8314274Z         x0 = x[:, :D]
2025-05-07T20:32:45.8314496Z         x1 = x[:, D:]
2025-05-07T20:32:45.8314699Z     
2025-05-07T20:32:45.8314892Z         if contiguous:
2025-05-07T20:32:45.8315131Z             x0 = x0.contiguous()
2025-05-07T20:32:45.8315391Z             x1 = x1.contiguous()
2025-05-07T20:32:45.8315639Z     
2025-05-07T20:32:45.8315835Z         if scale_ub is not None:
2025-05-07T20:32:45.8316110Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:45.8316455Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:45.8316774Z             )
2025-05-07T20:32:45.8316967Z         else:
2025-05-07T20:32:45.8317191Z             scale_ub_tensor = None
2025-05-07T20:32:45.8317456Z     
2025-05-07T20:32:45.8317689Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.8318021Z             op = silu_mul_quant
2025-05-07T20:32:45.8318285Z             if compiled:
2025-05-07T20:32:45.8318548Z                 op = torch.compile(op)
2025-05-07T20:32:45.8318849Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.8319143Z     
2025-05-07T20:32:45.8319343Z         y_fp8, y_scale = fn()
2025-05-07T20:32:45.8319631Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:45.8319936Z     
2025-05-07T20:32:45.8320183Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.8320527Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:45.8320840Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:45.8321185Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:45.8321933Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:45.8322272Z     
2025-05-07T20:32:45.8322490Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:45.8322693Z 
2025-05-07T20:32:45.8322808Z moe/activation_test.py:126: 
2025-05-07T20:32:45.8323266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.8323628Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:45.8323970Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:45.8324797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:45.8325886Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:45.8326471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:45.8327347Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:45.8328086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:45.8328859Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:45.8329648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:45.8330333Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:45.8330970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:45.8331530Z     fn()
2025-05-07T20:32:45.8332073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:45.8332696Z     self.fn.run(
2025-05-07T20:32:45.8333196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:45.8333771Z     kernel = self.compile(
2025-05-07T20:32:45.8334342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:45.8335110Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.8335540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.8335783Z 
2025-05-07T20:32:45.8336007Z self = <triton.compiler.compiler.ASTSource object at 0x7f127ad348c0>
2025-05-07T20:32:45.8337141Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:45.8338601Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1279cf4860>}
2025-05-07T20:32:45.8340031Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:45.8341136Z context = <triton._C.libtriton.ir.context object at 0x7f127a3709f0>
2025-05-07T20:32:45.8341448Z 
2025-05-07T20:32:45.8341636Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:45.8342210Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.8342724Z                            module_map=module_map)
2025-05-07T20:32:45.8343110Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.8343477Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:45.8343764Z E       ^
2025-05-07T20:32:45.8344258Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.8344734Z 
2025-05-07T20:32:45.8345343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:45.8345895Z 
2025-05-07T20:32:45.8346006Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.8346457Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.8347007Z     T=128,
2025-05-07T20:32:45.8347210Z     D=5120,
2025-05-07T20:32:45.8347409Z     scale_ub=None,
2025-05-07T20:32:45.8347643Z     contiguous=True,
2025-05-07T20:32:45.8347878Z     compiled=True,
2025-05-07T20:32:45.8348084Z )
2025-05-07T20:32:46.0803373Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:46.0804520Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:46.0805983Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:46.0807528Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:46.0808576Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.0809977Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:46.0811457Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.0812510Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.0813835Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:46.0815384Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.0816502Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.0817868Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:46.0819217Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:46.0820528Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:46.0821818Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:46.0822693Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.0824207Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:46.0825308Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:46.0826558Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:32:46.0827852Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:46.0829214Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:46.0830412Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:46.0831519Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:46.0832838Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:46.0834279Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:46.0835394Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.0836368Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.0837160Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:46.0838245Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.1510009Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:46.1511153Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:46.1512641Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:46.1514142Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:46.1515179Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.1516557Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:46.1518016Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.1519249Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.1520554Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:46.1522136Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.1523268Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.1524630Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:46.1526120Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:46.1527419Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:46.1528695Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:46.1529567Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.1530653Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:46.1531726Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:46.1532610Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:32:46.1533891Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:46.1535321Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:46.1536502Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:46.1537606Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:46.1538847Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:46.1540284Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:46.1541401Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.1542405Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.1543305Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:46.1544382Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.3579313Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:46.3580436Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:46.3581848Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:46.3583364Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:46.3584389Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.3585778Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:46.3587241Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.3588276Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.3589579Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:46.3591045Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.3592165Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.3593524Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:46.3594853Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:46.3596146Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:46.3597436Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:46.3598305Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.3599391Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:46.3600617Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:46.3601461Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:32:46.3602740Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:46.3604238Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:46.3605422Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:46.3606531Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:46.3607787Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:46.3609225Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:46.3610345Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.3611306Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.3612087Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:46.3613209Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.3680013Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:46.3681277Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:46.3682688Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:46.3684239Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:46.3685270Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.3686644Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:46.3688225Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.3689259Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.3690707Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:46.3692172Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.3693409Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.3694825Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:46.3696148Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:46.3697447Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:46.3698738Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:46.3699606Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.3700691Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:46.3701777Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:46.3702623Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]            ^^^^^^^^^^^^^
2025-05-07T20:32:46.3703907Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:46.3705272Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:46.3706455Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:46.3707559Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:46.3708812Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:46.3710253Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:46.3711372Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.3712347Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.3713160Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:46.3714318Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.6149554Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.6150180Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:46.6150574Z 
2025-05-07T20:32:46.6150847Z     @given(
2025-05-07T20:32:46.6151090Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.6151424Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.6151752Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.6152098Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.6152471Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.6152767Z     )
2025-05-07T20:32:46.6153136Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.6153612Z     def test_silu_mul_quant(
2025-05-07T20:32:46.6153862Z         self,
2025-05-07T20:32:46.6154068Z         T: int,
2025-05-07T20:32:46.6154277Z         D: int,
2025-05-07T20:32:46.6154508Z         scale_ub: Optional[float],
2025-05-07T20:32:46.6154798Z         contiguous: bool,
2025-05-07T20:32:46.6155055Z         compiled: bool,
2025-05-07T20:32:46.6155285Z     ) -> None:
2025-05-07T20:32:46.6155504Z         torch.manual_seed(2025)
2025-05-07T20:32:46.6155760Z     
2025-05-07T20:32:46.6156034Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.6156394Z     
2025-05-07T20:32:46.6156590Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.6156882Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.6157205Z         x = x_sign * x_clamp
2025-05-07T20:32:46.6157454Z         x0 = x[:, :D]
2025-05-07T20:32:46.6157672Z         x1 = x[:, D:]
2025-05-07T20:32:46.6157880Z     
2025-05-07T20:32:46.6158067Z         if contiguous:
2025-05-07T20:32:46.6158302Z             x0 = x0.contiguous()
2025-05-07T20:32:46.6158563Z             x1 = x1.contiguous()
2025-05-07T20:32:46.6158812Z     
2025-05-07T20:32:46.6159008Z         if scale_ub is not None:
2025-05-07T20:32:46.6159279Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.6159619Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.6159942Z             )
2025-05-07T20:32:46.6160144Z         else:
2025-05-07T20:32:46.6160361Z             scale_ub_tensor = None
2025-05-07T20:32:46.6160629Z     
2025-05-07T20:32:46.6160864Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.6161183Z             op = silu_mul_quant
2025-05-07T20:32:46.6161437Z             if compiled:
2025-05-07T20:32:46.6161680Z                 op = torch.compile(op)
2025-05-07T20:32:46.6161988Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.6162272Z     
2025-05-07T20:32:46.6162465Z         y_fp8, y_scale = fn()
2025-05-07T20:32:46.6162754Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:46.6163053Z     
2025-05-07T20:32:46.6163292Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.6163644Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:46.6163949Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:46.6164269Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:46.6164634Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:46.6164968Z     
2025-05-07T20:32:46.6165174Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:46.6165375Z 
2025-05-07T20:32:46.6165489Z moe/activation_test.py:126: 
2025-05-07T20:32:46.6165786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.6166134Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:46.6166477Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:46.6167299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:46.6168084Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:46.6168780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.6169506Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.6170311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:46.6171068Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:46.6171834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:46.6180748Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:46.6181437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:46.6181986Z     fn()
2025-05-07T20:32:46.6182590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:46.6183215Z     self.fn.run(
2025-05-07T20:32:46.6183725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.6184301Z     kernel = self.compile(
2025-05-07T20:32:46.6184875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.6185573Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.6185996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.6186250Z 
2025-05-07T20:32:46.6186465Z self = <triton.compiler.compiler.ASTSource object at 0x7f1279c9a8a0>
2025-05-07T20:32:46.6187602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.6189052Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127a6a5f80>}
2025-05-07T20:32:46.6190482Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.6191569Z context = <triton._C.libtriton.ir.context object at 0x7f127a13d170>
2025-05-07T20:32:46.6191879Z 
2025-05-07T20:32:46.6192051Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.6192599Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.6193096Z                            module_map=module_map)
2025-05-07T20:32:46.6193471Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.6193853Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:46.6194140Z E       ^
2025-05-07T20:32:46.6194623Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.6195108Z 
2025-05-07T20:32:46.6195555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.6196117Z 
2025-05-07T20:32:46.6196224Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.6196659Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.6197084Z     T=4096,
2025-05-07T20:32:46.6197288Z     D=5120,
2025-05-07T20:32:46.6197502Z     scale_ub=None,
2025-05-07T20:32:46.6197720Z     contiguous=True,
2025-05-07T20:32:46.6197963Z     compiled=True,
2025-05-07T20:32:46.6198182Z )
2025-05-07T20:32:46.8692127Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:46.8693292Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:46.8694776Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:46.8696400Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:46.8697424Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.8698805Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:46.8700253Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.8701289Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.8702578Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:46.8704033Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.8705145Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.8706489Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:46.8707803Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:46.8709085Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:46.8710359Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:46.8711231Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.8712310Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:46.8713375Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:46.8714203Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:46.8715558Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:46.8716919Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:46.8718093Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:46.8719269Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:46.8720511Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:46.8721951Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:46.8723109Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.8724069Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.8724851Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:46.8726079Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.9395670Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:46.9397703Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:46.9400257Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:46.9403014Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:46.9404092Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.9405471Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:46.9406929Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.9407954Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.9409252Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:46.9410707Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.9411997Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.9413361Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:46.9414901Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:46.9416198Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:46.9417486Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:46.9418373Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:46.9419461Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:46.9420539Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:46.9421380Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:46.9422678Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:46.9424082Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:46.9425253Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:46.9426499Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:46.9427739Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:46.9429171Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:46.9430290Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.9431238Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.9432011Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:46.9433135Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.1465439Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:47.1466566Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:47.1468140Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:47.1469654Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:47.1470830Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:47.1472208Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:47.1473687Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.1474725Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:47.1476030Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:47.1477494Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.1478613Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:47.1479977Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:47.1481309Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:47.1482605Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:47.1483892Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:47.1484762Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:47.1485854Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:47.1486937Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:47.1487788Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:47.1489066Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:47.1490430Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:47.1491696Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:47.1492800Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:47.1494120Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:47.1495648Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:47.1496766Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.1497727Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.1498505Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:47.1499752Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.1560499Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:47.1563073Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:47.1564515Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:47.1566006Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:47.1567041Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:47.1568409Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:47.1569865Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.1570903Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:47.1572191Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:47.1573703Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.1574951Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:47.1576451Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:47.1577773Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:47.1579224Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:47.1580496Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:47.1581356Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:47.1582441Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:32:47.1583519Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:47.1584351Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]            ^^^^^^^^^^^^^
2025-05-07T20:32:47.1585631Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:47.1588412Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:47.1589587Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:32:47.1590689Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:47.1591931Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:47.1593363Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:47.1594467Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.1595417Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.1596195Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:47.1597265Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4068237Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4068977Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:47.4069394Z 
2025-05-07T20:32:47.4069505Z     @given(
2025-05-07T20:32:47.4069835Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4070250Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4070581Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4070932Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4071285Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4071587Z     )
2025-05-07T20:32:47.4072120Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4072587Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4072826Z         self,
2025-05-07T20:32:47.4073028Z         T: int,
2025-05-07T20:32:47.4073232Z         D: int,
2025-05-07T20:32:47.4073558Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4073845Z         contiguous: bool,
2025-05-07T20:32:47.4074093Z         compiled: bool,
2025-05-07T20:32:47.4074313Z     ) -> None:
2025-05-07T20:32:47.4074531Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4074788Z     
2025-05-07T20:32:47.4075061Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4075421Z     
2025-05-07T20:32:47.4075619Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4075911Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4076228Z         x = x_sign * x_clamp
2025-05-07T20:32:47.4076472Z         x0 = x[:, :D]
2025-05-07T20:32:47.4076679Z         x1 = x[:, D:]
2025-05-07T20:32:47.4076899Z     
2025-05-07T20:32:47.4077085Z         if contiguous:
2025-05-07T20:32:47.4077314Z             x0 = x0.contiguous()
2025-05-07T20:32:47.4077574Z             x1 = x1.contiguous()
2025-05-07T20:32:47.4077819Z     
2025-05-07T20:32:47.4078009Z         if scale_ub is not None:
2025-05-07T20:32:47.4078304Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.4078646Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.4078961Z             )
2025-05-07T20:32:47.4079158Z         else:
2025-05-07T20:32:47.4079365Z             scale_ub_tensor = None
2025-05-07T20:32:47.4079626Z     
2025-05-07T20:32:47.4079862Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4080174Z             op = silu_mul_quant
2025-05-07T20:32:47.4080430Z             if compiled:
2025-05-07T20:32:47.4080682Z                 op = torch.compile(op)
2025-05-07T20:32:47.4080979Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4081265Z     
2025-05-07T20:32:47.4081470Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.4081753Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.4082056Z     
2025-05-07T20:32:47.4082297Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4082637Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.4082942Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.4083269Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.4083639Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.4083951Z     
2025-05-07T20:32:47.4084152Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.4084353Z 
2025-05-07T20:32:47.4084460Z moe/activation_test.py:126: 
2025-05-07T20:32:47.4084760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4085099Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.4085436Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.4086253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.4087042Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.4087626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.4088341Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.4089061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.4089821Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.4090589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.4091381Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.4092009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.4092589Z     fn()
2025-05-07T20:32:47.4093140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.4093831Z     self.fn.run(
2025-05-07T20:32:47.4094320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.4095043Z     kernel = self.compile(
2025-05-07T20:32:47.4095611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.4096297Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.4096713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4096952Z 
2025-05-07T20:32:47.4097183Z self = <triton.compiler.compiler.ASTSource object at 0x7f1279fc8c20>
2025-05-07T20:32:47.4098313Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.4099751Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127a002520>}
2025-05-07T20:32:47.4101169Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.4102265Z context = <triton._C.libtriton.ir.context object at 0x7f127985df30>
2025-05-07T20:32:47.4102565Z 
2025-05-07T20:32:47.4102745Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.4103291Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.4103778Z                            module_map=module_map)
2025-05-07T20:32:47.4104162Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.4104541Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.4104814Z E       ^
2025-05-07T20:32:47.4105296Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4105771Z 
2025-05-07T20:32:47.4106216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4106758Z 
2025-05-07T20:32:47.4106871Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4107294Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4107718Z     T=16384,
2025-05-07T20:32:47.4107921Z     D=5120,
2025-05-07T20:32:47.4108111Z     scale_ub=None,
2025-05-07T20:32:47.4108332Z     contiguous=True,
2025-05-07T20:32:47.4108558Z     compiled=True,
2025-05-07T20:32:47.4108755Z )
2025-05-07T20:32:47.4393957Z W0507 20:32:47.437000 95353 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:47.4395266Z W0507 20:32:47.437000 95353 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:47.4396672Z W0507 20:32:47.437000 95353 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:47.4397709Z W0507 20:32:47.437000 95353 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:47.4399021Z W0507 20:32:47.437000 95353 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:47.5245759Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.5247019Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:47.5247962Z 
2025-05-07T20:32:47.5248141Z     @given(
2025-05-07T20:32:47.5248603Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.5249227Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.5249845Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.5250520Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.5251179Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.5251754Z     )
2025-05-07T20:32:47.5252455Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.5253249Z     def test_silu_mul_quant(
2025-05-07T20:32:47.5253506Z         self,
2025-05-07T20:32:47.5253706Z         T: int,
2025-05-07T20:32:47.5253906Z         D: int,
2025-05-07T20:32:47.5254127Z         scale_ub: Optional[float],
2025-05-07T20:32:47.5254533Z         contiguous: bool,
2025-05-07T20:32:47.5254773Z         compiled: bool,
2025-05-07T20:32:47.5254996Z     ) -> None:
2025-05-07T20:32:47.5255207Z         torch.manual_seed(2025)
2025-05-07T20:32:47.5255447Z     
2025-05-07T20:32:47.5255715Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.5256067Z     
2025-05-07T20:32:47.5256261Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.5256541Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.5256853Z         x = x_sign * x_clamp
2025-05-07T20:32:47.5257089Z         x0 = x[:, :D]
2025-05-07T20:32:47.5257297Z         x1 = x[:, D:]
2025-05-07T20:32:47.5257505Z     
2025-05-07T20:32:47.5257688Z         if contiguous:
2025-05-07T20:32:47.5257913Z             x0 = x0.contiguous()
2025-05-07T20:32:47.5258179Z             x1 = x1.contiguous()
2025-05-07T20:32:47.5258425Z     
2025-05-07T20:32:47.5258609Z         if scale_ub is not None:
2025-05-07T20:32:47.5258884Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.5259223Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.5259529Z             )
2025-05-07T20:32:47.5259723Z         else:
2025-05-07T20:32:47.5259932Z             scale_ub_tensor = None
2025-05-07T20:32:47.5260184Z     
2025-05-07T20:32:47.5260419Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.5260734Z             op = silu_mul_quant
2025-05-07T20:32:47.5260988Z             if compiled:
2025-05-07T20:32:47.5261228Z                 op = torch.compile(op)
2025-05-07T20:32:47.5261526Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.5261803Z     
2025-05-07T20:32:47.5261989Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.5262272Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.5262572Z     
2025-05-07T20:32:47.5262808Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.5263148Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.5263443Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.5263757Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.5264125Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.5264438Z     
2025-05-07T20:32:47.5264635Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.5264831Z 
2025-05-07T20:32:47.5264933Z moe/activation_test.py:126: 
2025-05-07T20:32:47.5265231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.5265569Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.5265894Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.5266840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.5267640Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.5268209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.5269001Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.5269720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.5270478Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.5271241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.5271912Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.5272543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.5273150Z     fn()
2025-05-07T20:32:47.5273676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.5274294Z     self.fn.run(
2025-05-07T20:32:47.5274786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.5275344Z     kernel = self.compile(
2025-05-07T20:32:47.5275910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.5276598Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.5277006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.5277241Z 
2025-05-07T20:32:47.5277451Z self = <triton.compiler.compiler.ASTSource object at 0x7f127a008c80>
2025-05-07T20:32:47.5278585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.5280011Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1279963420>}
2025-05-07T20:32:47.5281426Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.5282514Z context = <triton._C.libtriton.ir.context object at 0x7f1278f8efb0>
2025-05-07T20:32:47.5282849Z 
2025-05-07T20:32:47.5283030Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.5283571Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.5284064Z                            module_map=module_map)
2025-05-07T20:32:47.5284433Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.5284804Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.5285088Z E       ^
2025-05-07T20:32:47.5285573Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.5286050Z 
2025-05-07T20:32:47.5286491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.5287038Z 
2025-05-07T20:32:47.5287145Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.5287589Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.5288018Z     T=1,
2025-05-07T20:32:47.5288208Z     D=5120,
2025-05-07T20:32:47.5288417Z     scale_ub=1200.0,
2025-05-07T20:32:47.5288653Z     contiguous=True,
2025-05-07T20:32:47.5288881Z     compiled=True,
2025-05-07T20:32:47.5289092Z )
2025-05-07T20:32:47.6646964Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.6647959Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:47.6648460Z 
2025-05-07T20:32:47.6648614Z     @given(
2025-05-07T20:32:47.6649234Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.6649814Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.6650377Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.6659161Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.6659509Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.6659798Z     )
2025-05-07T20:32:47.6660162Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.6660629Z     def test_silu_mul_quant(
2025-05-07T20:32:47.6660880Z         self,
2025-05-07T20:32:47.6661083Z         T: int,
2025-05-07T20:32:47.6661285Z         D: int,
2025-05-07T20:32:47.6661512Z         scale_ub: Optional[float],
2025-05-07T20:32:47.6661801Z         contiguous: bool,
2025-05-07T20:32:47.6662052Z         compiled: bool,
2025-05-07T20:32:47.6662285Z     ) -> None:
2025-05-07T20:32:47.6662502Z         torch.manual_seed(2025)
2025-05-07T20:32:47.6662754Z     
2025-05-07T20:32:47.6663047Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.6663401Z     
2025-05-07T20:32:47.6663608Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.6663909Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.6664224Z         x = x_sign * x_clamp
2025-05-07T20:32:47.6664470Z         x0 = x[:, :D]
2025-05-07T20:32:47.6664692Z         x1 = x[:, D:]
2025-05-07T20:32:47.6664901Z     
2025-05-07T20:32:47.6665096Z         if contiguous:
2025-05-07T20:32:47.6665332Z             x0 = x0.contiguous()
2025-05-07T20:32:47.6665585Z             x1 = x1.contiguous()
2025-05-07T20:32:47.6665832Z     
2025-05-07T20:32:47.6666023Z         if scale_ub is not None:
2025-05-07T20:32:47.6666296Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.6666639Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.6666957Z             )
2025-05-07T20:32:47.6667147Z         else:
2025-05-07T20:32:47.6667377Z             scale_ub_tensor = None
2025-05-07T20:32:47.6667646Z     
2025-05-07T20:32:47.6667881Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.6668198Z             op = silu_mul_quant
2025-05-07T20:32:47.6668450Z             if compiled:
2025-05-07T20:32:47.6668705Z                 op = torch.compile(op)
2025-05-07T20:32:47.6669007Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.6669291Z     
2025-05-07T20:32:47.6669482Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.6669649Z 
2025-05-07T20:32:47.6669748Z moe/activation_test.py:117: 
2025-05-07T20:32:47.6670054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.6670402Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.6670683Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.6671271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.6671859Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.6672554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.6673275Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.6673837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.6674554Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.6675255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.6675810Z     kernel = self.compile(
2025-05-07T20:32:47.6676487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.6677190Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.6677602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.6677924Z 
2025-05-07T20:32:47.6678135Z self = <triton.compiler.compiler.ASTSource object at 0x7f1295c4f560>
2025-05-07T20:32:47.6679268Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.6680703Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1279470180>}
2025-05-07T20:32:47.6682120Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.6683241Z context = <triton._C.libtriton.ir.context object at 0x7f12784b8a70>
2025-05-07T20:32:47.6683566Z 
2025-05-07T20:32:47.6683742Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.6684285Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.6684773Z                            module_map=module_map)
2025-05-07T20:32:47.6685142Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.6685507Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.6685776Z E       ^
2025-05-07T20:32:47.6686253Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.6686742Z 
2025-05-07T20:32:47.6687188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.6687738Z 
2025-05-07T20:32:47.6687840Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.6688268Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.6688690Z     T=1,
2025-05-07T20:32:47.6688880Z     D=5120,
2025-05-07T20:32:47.6689081Z     scale_ub=None,
2025-05-07T20:32:47.6689295Z     contiguous=False,
2025-05-07T20:32:47.6689528Z     compiled=True,
2025-05-07T20:32:47.6689733Z )
2025-05-07T20:32:47.7282476Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.7283314Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:47.7283618Z 
2025-05-07T20:32:47.7283699Z     @given(
2025-05-07T20:32:47.7283934Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.7284252Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.7284557Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.7284899Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.7285235Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.7285526Z     )
2025-05-07T20:32:47.7285883Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.7286349Z     def test_silu_mul_quant(
2025-05-07T20:32:47.7286588Z         self,
2025-05-07T20:32:47.7286780Z         T: int,
2025-05-07T20:32:47.7286975Z         D: int,
2025-05-07T20:32:47.7287186Z         scale_ub: Optional[float],
2025-05-07T20:32:47.7287464Z         contiguous: bool,
2025-05-07T20:32:47.7287704Z         compiled: bool,
2025-05-07T20:32:47.7287922Z     ) -> None:
2025-05-07T20:32:47.7288146Z         torch.manual_seed(2025)
2025-05-07T20:32:47.7288389Z     
2025-05-07T20:32:47.7288660Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.7289024Z     
2025-05-07T20:32:47.7289216Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.7289672Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.7289989Z         x = x_sign * x_clamp
2025-05-07T20:32:47.7290226Z         x0 = x[:, :D]
2025-05-07T20:32:47.7290439Z         x1 = x[:, D:]
2025-05-07T20:32:47.7290646Z     
2025-05-07T20:32:47.7290943Z         if contiguous:
2025-05-07T20:32:47.7291180Z             x0 = x0.contiguous()
2025-05-07T20:32:47.7291435Z             x1 = x1.contiguous()
2025-05-07T20:32:47.7291678Z     
2025-05-07T20:32:47.7291872Z         if scale_ub is not None:
2025-05-07T20:32:47.7292144Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.7292485Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.7292804Z             )
2025-05-07T20:32:47.7292992Z         else:
2025-05-07T20:32:47.7293205Z             scale_ub_tensor = None
2025-05-07T20:32:47.7293462Z     
2025-05-07T20:32:47.7293689Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.7294012Z             op = silu_mul_quant
2025-05-07T20:32:47.7294271Z             if compiled:
2025-05-07T20:32:47.7294626Z                 op = torch.compile(op)
2025-05-07T20:32:47.7294922Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.7295211Z     
2025-05-07T20:32:47.7295401Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.7295688Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.7295987Z     
2025-05-07T20:32:47.7296226Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.7296560Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.7296864Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.7297183Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.7297541Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.7297858Z     
2025-05-07T20:32:47.7298056Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.7298255Z 
2025-05-07T20:32:47.7298356Z moe/activation_test.py:126: 
2025-05-07T20:32:47.7298654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.7299008Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.7299348Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.7300172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.7300960Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.7301531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.7302244Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.7302964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.7303774Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.7304549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.7305221Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.7305850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.7306401Z     fn()
2025-05-07T20:32:47.7306935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.7307541Z     self.fn.run(
2025-05-07T20:32:47.7308030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.7308582Z     kernel = self.compile(
2025-05-07T20:32:47.7309137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.7309911Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.7310331Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.7310576Z 
2025-05-07T20:32:47.7310796Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278d24d40>
2025-05-07T20:32:47.7311999Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.7313482Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1279473240>}
2025-05-07T20:32:47.7314896Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.7315993Z context = <triton._C.libtriton.ir.context object at 0x7f1278450e30>
2025-05-07T20:32:47.7316299Z 
2025-05-07T20:32:47.7316475Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.7317011Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.7317504Z                            module_map=module_map)
2025-05-07T20:32:47.7317886Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.7318252Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.7318529Z E       ^
2025-05-07T20:32:47.7319013Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.7319490Z 
2025-05-07T20:32:47.7319933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.7320478Z 
2025-05-07T20:32:47.7320582Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.7321022Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.7321446Z     T=1,
2025-05-07T20:32:47.7321623Z     D=5120,
2025-05-07T20:32:47.7321830Z     scale_ub=None,
2025-05-07T20:32:47.7322051Z     contiguous=True,
2025-05-07T20:32:47.7322283Z     compiled=False,
2025-05-07T20:32:47.7322485Z )
2025-05-07T20:32:47.8800580Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.8801175Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:47.8801569Z 
2025-05-07T20:32:47.8801698Z     @given(
2025-05-07T20:32:47.8802007Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.8802329Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.8802666Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.8803286Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.8803913Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.8804451Z     )
2025-05-07T20:32:47.8805090Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.8805909Z     def test_silu_mul_quant(
2025-05-07T20:32:47.8806344Z         self,
2025-05-07T20:32:47.8806700Z         T: int,
2025-05-07T20:32:47.8807042Z         D: int,
2025-05-07T20:32:47.8807437Z         scale_ub: Optional[float],
2025-05-07T20:32:47.8807923Z         contiguous: bool,
2025-05-07T20:32:47.8808348Z         compiled: bool,
2025-05-07T20:32:47.8808749Z     ) -> None:
2025-05-07T20:32:47.8809147Z         torch.manual_seed(2025)
2025-05-07T20:32:47.8809586Z     
2025-05-07T20:32:47.8810099Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.8810753Z     
2025-05-07T20:32:47.8811102Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.8811638Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.8812219Z         x = x_sign * x_clamp
2025-05-07T20:32:47.8812945Z         x0 = x[:, :D]
2025-05-07T20:32:47.8813326Z         x1 = x[:, D:]
2025-05-07T20:32:47.8813562Z     
2025-05-07T20:32:47.8813754Z         if contiguous:
2025-05-07T20:32:47.8813983Z             x0 = x0.contiguous()
2025-05-07T20:32:47.8814250Z             x1 = x1.contiguous()
2025-05-07T20:32:47.8814739Z     
2025-05-07T20:32:47.8814938Z         if scale_ub is not None:
2025-05-07T20:32:47.8815220Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.8815567Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.8815882Z             )
2025-05-07T20:32:47.8816084Z         else:
2025-05-07T20:32:47.8816301Z             scale_ub_tensor = None
2025-05-07T20:32:47.8816554Z     
2025-05-07T20:32:47.8816791Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.8817120Z             op = silu_mul_quant
2025-05-07T20:32:47.8817371Z             if compiled:
2025-05-07T20:32:47.8817625Z                 op = torch.compile(op)
2025-05-07T20:32:47.8817939Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.8818226Z     
2025-05-07T20:32:47.8818418Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.8818593Z 
2025-05-07T20:32:47.8818694Z moe/activation_test.py:117: 
2025-05-07T20:32:47.8819002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.8819349Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.8819636Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.8820370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.8821094Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.8821660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.8822383Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.8823090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.8823650Z     kernel = self.compile(
2025-05-07T20:32:47.8824216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.8824911Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.8825329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.8825737Z 
2025-05-07T20:32:47.8825950Z self = <triton.compiler.compiler.ASTSource object at 0x7f127949a750>
2025-05-07T20:32:47.8827083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.8828522Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127a001bc0>}
2025-05-07T20:32:47.8829932Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.8831023Z context = <triton._C.libtriton.ir.context object at 0x7f1278444a70>
2025-05-07T20:32:47.8831329Z 
2025-05-07T20:32:47.8831502Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.8832045Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.8832533Z                            module_map=module_map)
2025-05-07T20:32:47.8832901Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.8833267Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.8833536Z E       ^
2025-05-07T20:32:47.8834137Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.8834618Z 
2025-05-07T20:32:47.8835060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.8835608Z 
2025-05-07T20:32:47.8835824Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.8836258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.8836674Z     T=128,
2025-05-07T20:32:47.8836867Z     D=5120,
2025-05-07T20:32:47.8837065Z     scale_ub=None,
2025-05-07T20:32:47.8837282Z     contiguous=False,
2025-05-07T20:32:47.8837512Z     compiled=True,
2025-05-07T20:32:47.8837721Z )
2025-05-07T20:32:47.8838042Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.8838548Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:47.8838854Z 
2025-05-07T20:32:47.8838938Z     @given(
2025-05-07T20:32:47.8839184Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.8839511Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.8839820Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.8840164Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.8840508Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.8840798Z     )
2025-05-07T20:32:47.8841156Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.8841617Z     def test_silu_mul_quant(
2025-05-07T20:32:47.8841862Z         self,
2025-05-07T20:32:47.8842063Z         T: int,
2025-05-07T20:32:47.8842266Z         D: int,
2025-05-07T20:32:47.8842483Z         scale_ub: Optional[float],
2025-05-07T20:32:47.8842763Z         contiguous: bool,
2025-05-07T20:32:47.8843005Z         compiled: bool,
2025-05-07T20:32:47.8843232Z     ) -> None:
2025-05-07T20:32:47.8843445Z         torch.manual_seed(2025)
2025-05-07T20:32:47.8843694Z     
2025-05-07T20:32:47.8843990Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.8844345Z     
2025-05-07T20:32:47.8844542Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.8844837Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.8845158Z         x = x_sign * x_clamp
2025-05-07T20:32:47.8845405Z         x0 = x[:, :D]
2025-05-07T20:32:47.8845624Z         x1 = x[:, D:]
2025-05-07T20:32:47.8845834Z     
2025-05-07T20:32:47.8846026Z         if contiguous:
2025-05-07T20:32:47.8846263Z             x0 = x0.contiguous()
2025-05-07T20:32:47.8846526Z             x1 = x1.contiguous()
2025-05-07T20:32:47.8846774Z     
2025-05-07T20:32:47.8846971Z         if scale_ub is not None:
2025-05-07T20:32:47.8847246Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.8847590Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.8847912Z             )
2025-05-07T20:32:47.8848097Z         else:
2025-05-07T20:32:47.8848315Z             scale_ub_tensor = None
2025-05-07T20:32:47.8848583Z     
2025-05-07T20:32:47.8848817Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.8849132Z             op = silu_mul_quant
2025-05-07T20:32:47.8849382Z             if compiled:
2025-05-07T20:32:47.8849632Z                 op = torch.compile(op)
2025-05-07T20:32:47.8849935Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.8850213Z     
2025-05-07T20:32:47.8850404Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.8850569Z 
2025-05-07T20:32:47.8850666Z moe/activation_test.py:117: 
2025-05-07T20:32:47.8850970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.8851311Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.8851590Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.8852169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.8852758Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.8853531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.8854261Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.8854905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.8855704Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.8856406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.8856965Z     kernel = self.compile(
2025-05-07T20:32:47.8857532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.8858225Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.8858642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.8858896Z 
2025-05-07T20:32:47.8859110Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278d42b40>
2025-05-07T20:32:47.8860237Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.8861675Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1279459b20>}
2025-05-07T20:32:47.8863083Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.8864168Z context = <triton._C.libtriton.ir.context object at 0x7f1278619c70>
2025-05-07T20:32:47.8864473Z 
2025-05-07T20:32:47.8864647Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.8865190Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.8865676Z                            module_map=module_map)
2025-05-07T20:32:47.8866053Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.8866420Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.8866688Z E       ^
2025-05-07T20:32:47.8867168Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.8867648Z 
2025-05-07T20:32:47.8868089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.8868637Z 
2025-05-07T20:32:47.8868744Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.8869174Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.8869588Z     T=128,
2025-05-07T20:32:47.8869788Z     D=7168,
2025-05-07T20:32:47.8869988Z     scale_ub=1200.0,
2025-05-07T20:32:47.8870215Z     contiguous=False,
2025-05-07T20:32:47.8870448Z     compiled=False,
2025-05-07T20:32:47.8870656Z )
2025-05-07T20:32:47.9977907Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.9978472Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:47.9978774Z 
2025-05-07T20:32:47.9978860Z     @given(
2025-05-07T20:32:47.9979091Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.9979413Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.9979721Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.9980058Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.9980398Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.9980690Z     )
2025-05-07T20:32:47.9981196Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.9981660Z     def test_silu_mul_quant(
2025-05-07T20:32:47.9981903Z         self,
2025-05-07T20:32:47.9982089Z         T: int,
2025-05-07T20:32:47.9982279Z         D: int,
2025-05-07T20:32:47.9982490Z         scale_ub: Optional[float],
2025-05-07T20:32:47.9982940Z         contiguous: bool,
2025-05-07T20:32:47.9983176Z         compiled: bool,
2025-05-07T20:32:47.9983392Z     ) -> None:
2025-05-07T20:32:47.9983603Z         torch.manual_seed(2025)
2025-05-07T20:32:47.9983843Z     
2025-05-07T20:32:47.9984116Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.9984463Z     
2025-05-07T20:32:47.9984652Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.9984941Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.9985247Z         x = x_sign * x_clamp
2025-05-07T20:32:47.9985485Z         x0 = x[:, :D]
2025-05-07T20:32:47.9985696Z         x1 = x[:, D:]
2025-05-07T20:32:47.9985897Z     
2025-05-07T20:32:47.9986082Z         if contiguous:
2025-05-07T20:32:47.9986313Z             x0 = x0.contiguous()
2025-05-07T20:32:47.9986567Z             x1 = x1.contiguous()
2025-05-07T20:32:47.9986810Z     
2025-05-07T20:32:47.9986998Z         if scale_ub is not None:
2025-05-07T20:32:47.9987263Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.9987609Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.9987917Z             )
2025-05-07T20:32:47.9988111Z         else:
2025-05-07T20:32:47.9988324Z             scale_ub_tensor = None
2025-05-07T20:32:47.9995258Z     
2025-05-07T20:32:47.9995517Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.9995858Z             op = silu_mul_quant
2025-05-07T20:32:47.9996120Z             if compiled:
2025-05-07T20:32:47.9996366Z                 op = torch.compile(op)
2025-05-07T20:32:47.9996672Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.9996952Z     
2025-05-07T20:32:47.9997145Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.9997320Z 
2025-05-07T20:32:47.9997417Z moe/activation_test.py:117: 
2025-05-07T20:32:47.9997722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.9998064Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.9998351Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.9999069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.9999794Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.0000355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.0001069Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.0001765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.0002321Z     kernel = self.compile(
2025-05-07T20:32:48.0002879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.0003565Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.0003976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.0004220Z 
2025-05-07T20:32:48.0004438Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278d40140>
2025-05-07T20:32:48.0005556Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.0006977Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1278dbcea0>}
2025-05-07T20:32:48.0008504Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.0009595Z context = <triton._C.libtriton.ir.context object at 0x7f1278611530>
2025-05-07T20:32:48.0009969Z 
2025-05-07T20:32:48.0010142Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.0010675Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.0011151Z                            module_map=module_map)
2025-05-07T20:32:48.0011527Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.0011889Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.0012156Z E       ^
2025-05-07T20:32:48.0012640Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.0013110Z 
2025-05-07T20:32:48.0013562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.0014111Z 
2025-05-07T20:32:48.0014217Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.0014738Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.0015167Z     T=128,
2025-05-07T20:32:48.0015363Z     D=5120,
2025-05-07T20:32:48.0015558Z     scale_ub=None,
2025-05-07T20:32:48.0015777Z     contiguous=False,
2025-05-07T20:32:48.0016007Z     compiled=False,
2025-05-07T20:32:48.0016217Z )
2025-05-07T20:32:48.0016545Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.0017080Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:48.0017363Z 
2025-05-07T20:32:48.0017446Z     @given(
2025-05-07T20:32:48.0017674Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.0017995Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.0018319Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.0018654Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.0018992Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.0019292Z     )
2025-05-07T20:32:48.0019661Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.0020124Z     def test_silu_mul_quant(
2025-05-07T20:32:48.0020373Z         self,
2025-05-07T20:32:48.0020570Z         T: int,
2025-05-07T20:32:48.0020768Z         D: int,
2025-05-07T20:32:48.0020994Z         scale_ub: Optional[float],
2025-05-07T20:32:48.0021278Z         contiguous: bool,
2025-05-07T20:32:48.0021518Z         compiled: bool,
2025-05-07T20:32:48.0021741Z     ) -> None:
2025-05-07T20:32:48.0021951Z         torch.manual_seed(2025)
2025-05-07T20:32:48.0022185Z     
2025-05-07T20:32:48.0022466Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.0022823Z     
2025-05-07T20:32:48.0023012Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.0023301Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.0023615Z         x = x_sign * x_clamp
2025-05-07T20:32:48.0023851Z         x0 = x[:, :D]
2025-05-07T20:32:48.0024065Z         x1 = x[:, D:]
2025-05-07T20:32:48.0024279Z     
2025-05-07T20:32:48.0024461Z         if contiguous:
2025-05-07T20:32:48.0024688Z             x0 = x0.contiguous()
2025-05-07T20:32:48.0024945Z             x1 = x1.contiguous()
2025-05-07T20:32:48.0025188Z     
2025-05-07T20:32:48.0025378Z         if scale_ub is not None:
2025-05-07T20:32:48.0025811Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.0026147Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.0026469Z             )
2025-05-07T20:32:48.0026665Z         else:
2025-05-07T20:32:48.0026870Z             scale_ub_tensor = None
2025-05-07T20:32:48.0027120Z     
2025-05-07T20:32:48.0027349Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.0027814Z             op = silu_mul_quant
2025-05-07T20:32:48.0028070Z             if compiled:
2025-05-07T20:32:48.0028323Z                 op = torch.compile(op)
2025-05-07T20:32:48.0028630Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.0029018Z     
2025-05-07T20:32:48.0029212Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.0029379Z 
2025-05-07T20:32:48.0029484Z moe/activation_test.py:117: 
2025-05-07T20:32:48.0029782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.0030128Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.0030411Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.0031129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.0031843Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.0032403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.0033113Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.0033799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.0034353Z     kernel = self.compile(
2025-05-07T20:32:48.0034911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.0035590Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.0035992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.0036241Z 
2025-05-07T20:32:48.0036449Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278dba900>
2025-05-07T20:32:48.0037571Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.0038992Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1278dbcc20>}
2025-05-07T20:32:48.0040395Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.0041478Z context = <triton._C.libtriton.ir.context object at 0x7f127886ea70>
2025-05-07T20:32:48.0041781Z 
2025-05-07T20:32:48.0041945Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.0042474Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.0042944Z                            module_map=module_map)
2025-05-07T20:32:48.0043341Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.0043719Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.0043982Z E       ^
2025-05-07T20:32:48.0044450Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.0044928Z 
2025-05-07T20:32:48.0045366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.0045904Z 
2025-05-07T20:32:48.0046012Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.0046435Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.0046841Z     T=128,
2025-05-07T20:32:48.0047029Z     D=5120,
2025-05-07T20:32:48.0047226Z     scale_ub=1200.0,
2025-05-07T20:32:48.0047440Z     contiguous=True,
2025-05-07T20:32:48.0047656Z     compiled=False,
2025-05-07T20:32:48.0047862Z )
2025-05-07T20:32:48.4006813Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.4007507Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:48.4007837Z 
2025-05-07T20:32:48.4007960Z     @given(
2025-05-07T20:32:48.4008205Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.4008667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.4009059Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.4009522Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.4009947Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.4010318Z     )
2025-05-07T20:32:48.4010775Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.4011239Z     def test_silu_mul_quant(
2025-05-07T20:32:48.4011498Z         self,
2025-05-07T20:32:48.4011705Z         T: int,
2025-05-07T20:32:48.4011900Z         D: int,
2025-05-07T20:32:48.4012120Z         scale_ub: Optional[float],
2025-05-07T20:32:48.4012401Z         contiguous: bool,
2025-05-07T20:32:48.4012650Z         compiled: bool,
2025-05-07T20:32:48.4012876Z     ) -> None:
2025-05-07T20:32:48.4013089Z         torch.manual_seed(2025)
2025-05-07T20:32:48.4013342Z     
2025-05-07T20:32:48.4013660Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.4014016Z     
2025-05-07T20:32:48.4014209Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.4014627Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.4014934Z         x = x_sign * x_clamp
2025-05-07T20:32:48.4015175Z         x0 = x[:, :D]
2025-05-07T20:32:48.4015385Z         x1 = x[:, D:]
2025-05-07T20:32:48.4015589Z     
2025-05-07T20:32:48.4015768Z         if contiguous:
2025-05-07T20:32:48.4015990Z             x0 = x0.contiguous()
2025-05-07T20:32:48.4016251Z             x1 = x1.contiguous()
2025-05-07T20:32:48.4016491Z     
2025-05-07T20:32:48.4016681Z         if scale_ub is not None:
2025-05-07T20:32:48.4016951Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.4017300Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.4017615Z             )
2025-05-07T20:32:48.4017812Z         else:
2025-05-07T20:32:48.4018026Z             scale_ub_tensor = None
2025-05-07T20:32:48.4018284Z     
2025-05-07T20:32:48.4018518Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.4018841Z             op = silu_mul_quant
2025-05-07T20:32:48.4019091Z             if compiled:
2025-05-07T20:32:48.4019335Z                 op = torch.compile(op)
2025-05-07T20:32:48.4019638Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.4019916Z     
2025-05-07T20:32:48.4020099Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.4020270Z 
2025-05-07T20:32:48.4020369Z moe/activation_test.py:117: 
2025-05-07T20:32:48.4020669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.4021000Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.4021286Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.4022010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.4022733Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.4023283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.4024002Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.4024697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.4025248Z     kernel = self.compile(
2025-05-07T20:32:48.4025970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.4026657Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.4027061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.4027446Z 
2025-05-07T20:32:48.4027666Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278dbbce0>
2025-05-07T20:32:48.4028787Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.4030357Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1278dbeca0>}
2025-05-07T20:32:48.4031762Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.4032848Z context = <triton._C.libtriton.ir.context object at 0x7f1278593df0>
2025-05-07T20:32:48.4033145Z 
2025-05-07T20:32:48.4033328Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.4033861Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.4034347Z                            module_map=module_map)
2025-05-07T20:32:48.4034726Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.4035089Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.4035354Z E       ^
2025-05-07T20:32:48.4035832Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.4036302Z 
2025-05-07T20:32:48.4036742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.4037282Z 
2025-05-07T20:32:48.4037386Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.4037811Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.4038230Z     T=1,
2025-05-07T20:32:48.4038417Z     D=7168,
2025-05-07T20:32:48.4038612Z     scale_ub=1200.0,
2025-05-07T20:32:48.4038836Z     contiguous=True,
2025-05-07T20:32:48.4039055Z     compiled=True,
2025-05-07T20:32:48.4039261Z )
2025-05-07T20:32:48.4039583Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.4040091Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:48.4040360Z 
2025-05-07T20:32:48.4040439Z     @given(
2025-05-07T20:32:48.4040669Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.4040985Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.4041293Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.4041626Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.4041957Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.4042241Z     )
2025-05-07T20:32:48.4042598Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.4043049Z     def test_silu_mul_quant(
2025-05-07T20:32:48.4043312Z         self,
2025-05-07T20:32:48.4043509Z         T: int,
2025-05-07T20:32:48.4043714Z         D: int,
2025-05-07T20:32:48.4043924Z         scale_ub: Optional[float],
2025-05-07T20:32:48.4044198Z         contiguous: bool,
2025-05-07T20:32:48.4044429Z         compiled: bool,
2025-05-07T20:32:48.4044643Z     ) -> None:
2025-05-07T20:32:48.4044855Z         torch.manual_seed(2025)
2025-05-07T20:32:48.4045086Z     
2025-05-07T20:32:48.4045361Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.4045715Z     
2025-05-07T20:32:48.4045898Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.4046183Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.4046492Z         x = x_sign * x_clamp
2025-05-07T20:32:48.4046724Z         x0 = x[:, :D]
2025-05-07T20:32:48.4046934Z         x1 = x[:, D:]
2025-05-07T20:32:48.4047141Z     
2025-05-07T20:32:48.4047404Z         if contiguous:
2025-05-07T20:32:48.4047632Z             x0 = x0.contiguous()
2025-05-07T20:32:48.4047891Z             x1 = x1.contiguous()
2025-05-07T20:32:48.4048126Z     
2025-05-07T20:32:48.4048310Z         if scale_ub is not None:
2025-05-07T20:32:48.4048668Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.4049003Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.4049306Z             )
2025-05-07T20:32:48.4049495Z         else:
2025-05-07T20:32:48.4049701Z             scale_ub_tensor = None
2025-05-07T20:32:48.4049947Z     
2025-05-07T20:32:48.4050170Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.4050485Z             op = silu_mul_quant
2025-05-07T20:32:48.4050727Z             if compiled:
2025-05-07T20:32:48.4050968Z                 op = torch.compile(op)
2025-05-07T20:32:48.4051266Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.4051536Z     
2025-05-07T20:32:48.4051728Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.4051896Z 
2025-05-07T20:32:48.4051996Z moe/activation_test.py:117: 
2025-05-07T20:32:48.4052285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.4052616Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.4052909Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.4053488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:48.4054066Z     return fn(*args, **kwargs)
2025-05-07T20:32:48.4054849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.4055569Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.4056118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.4056824Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.4057520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.4058070Z     kernel = self.compile(
2025-05-07T20:32:48.4058622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.4059306Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.4059707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.4059937Z 
2025-05-07T20:32:48.4060149Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278db9a90>
2025-05-07T20:32:48.4061259Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.4062681Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1278524040>}
2025-05-07T20:32:48.4064072Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.4065154Z context = <triton._C.libtriton.ir.context object at 0x7f127853c830>
2025-05-07T20:32:48.4065449Z 
2025-05-07T20:32:48.4065613Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.4066149Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.4066629Z                            module_map=module_map)
2025-05-07T20:32:48.4066996Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.4067346Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.4067601Z E       ^
2025-05-07T20:32:48.4068156Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.4068627Z 
2025-05-07T20:32:48.4069060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.4069685Z 
2025-05-07T20:32:48.4069789Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.4070212Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.4070635Z     T=1,
2025-05-07T20:32:48.4070818Z     D=7168,
2025-05-07T20:32:48.4071018Z     scale_ub=1200.0,
2025-05-07T20:32:48.4071250Z     contiguous=False,
2025-05-07T20:32:48.4071475Z     compiled=True,
2025-05-07T20:32:48.4071683Z )
2025-05-07T20:32:48.5428498Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.5429028Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:48.5429337Z 
2025-05-07T20:32:48.5429431Z     @given(
2025-05-07T20:32:48.5429658Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.5429978Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.5430290Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.5430628Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.5430962Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.5431250Z     )
2025-05-07T20:32:48.5431600Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.5432058Z     def test_silu_mul_quant(
2025-05-07T20:32:48.5432298Z         self,
2025-05-07T20:32:48.5432489Z         T: int,
2025-05-07T20:32:48.5432687Z         D: int,
2025-05-07T20:32:48.5432906Z         scale_ub: Optional[float],
2025-05-07T20:32:48.5433182Z         contiguous: bool,
2025-05-07T20:32:48.5433424Z         compiled: bool,
2025-05-07T20:32:48.5433646Z     ) -> None:
2025-05-07T20:32:48.5433865Z         torch.manual_seed(2025)
2025-05-07T20:32:48.5434116Z     
2025-05-07T20:32:48.5434385Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.5434741Z     
2025-05-07T20:32:48.5434927Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.5435226Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.5435546Z         x = x_sign * x_clamp
2025-05-07T20:32:48.5435782Z         x0 = x[:, :D]
2025-05-07T20:32:48.5436001Z         x1 = x[:, D:]
2025-05-07T20:32:48.5436208Z     
2025-05-07T20:32:48.5436393Z         if contiguous:
2025-05-07T20:32:48.5436625Z             x0 = x0.contiguous()
2025-05-07T20:32:48.5436893Z             x1 = x1.contiguous()
2025-05-07T20:32:48.5437136Z     
2025-05-07T20:32:48.5437329Z         if scale_ub is not None:
2025-05-07T20:32:48.5437607Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.5437945Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.5438254Z             )
2025-05-07T20:32:48.5438472Z         else:
2025-05-07T20:32:48.5438692Z             scale_ub_tensor = None
2025-05-07T20:32:48.5438952Z     
2025-05-07T20:32:48.5439192Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.5439513Z             op = silu_mul_quant
2025-05-07T20:32:48.5439771Z             if compiled:
2025-05-07T20:32:48.5440025Z                 op = torch.compile(op)
2025-05-07T20:32:48.5440328Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.5440603Z     
2025-05-07T20:32:48.5440799Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.5440964Z 
2025-05-07T20:32:48.5441066Z moe/activation_test.py:117: 
2025-05-07T20:32:48.5441359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.5441699Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.5441985Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.5442566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:48.5443347Z     return fn(*args, **kwargs)
2025-05-07T20:32:48.5444036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.5444763Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.5445425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.5446138Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.5446834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.5447390Z     kernel = self.compile(
2025-05-07T20:32:48.5447947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.5448636Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.5449049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.5449289Z 
2025-05-07T20:32:48.5449499Z self = <triton.compiler.compiler.ASTSource object at 0x7f12785145f0>
2025-05-07T20:32:48.5450614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.5452042Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1278524ea0>}
2025-05-07T20:32:48.5453500Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.5454668Z context = <triton._C.libtriton.ir.context object at 0x7f12788fadb0>
2025-05-07T20:32:48.5454972Z 
2025-05-07T20:32:48.5455141Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.5455672Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.5456166Z                            module_map=module_map)
2025-05-07T20:32:48.5456540Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.5456896Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.5457159Z E       ^
2025-05-07T20:32:48.5457642Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.5458114Z 
2025-05-07T20:32:48.5458553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.5459100Z 
2025-05-07T20:32:48.5459202Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.5459635Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.5460050Z     T=1,
2025-05-07T20:32:48.5460236Z     D=7168,
2025-05-07T20:32:48.5460430Z     scale_ub=None,
2025-05-07T20:32:48.5466911Z     contiguous=False,
2025-05-07T20:32:48.5467149Z     compiled=True,
2025-05-07T20:32:48.5467372Z )
2025-05-07T20:32:48.6338224Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.6338762Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:48.6339074Z 
2025-05-07T20:32:48.6339184Z     @given(
2025-05-07T20:32:48.6339527Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.6339845Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.6340160Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.6340494Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.6340830Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.6341120Z     )
2025-05-07T20:32:48.6341645Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.6342107Z     def test_silu_mul_quant(
2025-05-07T20:32:48.6342343Z         self,
2025-05-07T20:32:48.6342540Z         T: int,
2025-05-07T20:32:48.6342733Z         D: int,
2025-05-07T20:32:48.6343073Z         scale_ub: Optional[float],
2025-05-07T20:32:48.6343371Z         contiguous: bool,
2025-05-07T20:32:48.6343630Z         compiled: bool,
2025-05-07T20:32:48.6343863Z     ) -> None:
2025-05-07T20:32:48.6344086Z         torch.manual_seed(2025)
2025-05-07T20:32:48.6344333Z     
2025-05-07T20:32:48.6344605Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.6344963Z     
2025-05-07T20:32:48.6345156Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.6345443Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.6345769Z         x = x_sign * x_clamp
2025-05-07T20:32:48.6346017Z         x0 = x[:, :D]
2025-05-07T20:32:48.6346240Z         x1 = x[:, D:]
2025-05-07T20:32:48.6346458Z     
2025-05-07T20:32:48.6346654Z         if contiguous:
2025-05-07T20:32:48.6346895Z             x0 = x0.contiguous()
2025-05-07T20:32:48.6347162Z             x1 = x1.contiguous()
2025-05-07T20:32:48.6347411Z     
2025-05-07T20:32:48.6347648Z         if scale_ub is not None:
2025-05-07T20:32:48.6347934Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.6348279Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.6348592Z             )
2025-05-07T20:32:48.6348785Z         else:
2025-05-07T20:32:48.6349001Z             scale_ub_tensor = None
2025-05-07T20:32:48.6349264Z     
2025-05-07T20:32:48.6349504Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.6349816Z             op = silu_mul_quant
2025-05-07T20:32:48.6350067Z             if compiled:
2025-05-07T20:32:48.6350315Z                 op = torch.compile(op)
2025-05-07T20:32:48.6350614Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.6350901Z     
2025-05-07T20:32:48.6351094Z         y_fp8, y_scale = fn()
2025-05-07T20:32:48.6351379Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:48.6351679Z     
2025-05-07T20:32:48.6351917Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.6352263Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:48.6352561Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:48.6352881Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:48.6353247Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.6353559Z     
2025-05-07T20:32:48.6353769Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:48.6353968Z 
2025-05-07T20:32:48.6354076Z moe/activation_test.py:126: 
2025-05-07T20:32:48.6354380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.6354732Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:48.6355077Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:48.6355905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:48.6356688Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:48.6357270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.6357984Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.6358704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:48.6359460Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:48.6360227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:48.6360898Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:48.6361608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:48.6362165Z     fn()
2025-05-07T20:32:48.6362703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:48.6363403Z     self.fn.run(
2025-05-07T20:32:48.6363891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.6364454Z     kernel = self.compile(
2025-05-07T20:32:48.6365026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.6365713Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.6366136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.6366384Z 
2025-05-07T20:32:48.6366604Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278d25a30>
2025-05-07T20:32:48.6367742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.6369182Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1278e5ec00>}
2025-05-07T20:32:48.6370597Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.6371688Z context = <triton._C.libtriton.ir.context object at 0x7f1279c67cb0>
2025-05-07T20:32:48.6371990Z 
2025-05-07T20:32:48.6372170Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.6372723Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.6373208Z                            module_map=module_map)
2025-05-07T20:32:48.6373587Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.6373963Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:48.6374234Z E       ^
2025-05-07T20:32:48.6374834Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.6375308Z 
2025-05-07T20:32:48.6375753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.6376298Z 
2025-05-07T20:32:48.6376409Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.6376835Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.6377263Z     T=1,
2025-05-07T20:32:48.6377459Z     D=5120,
2025-05-07T20:32:48.6377656Z     scale_ub=1200.0,
2025-05-07T20:32:48.6377894Z     contiguous=False,
2025-05-07T20:32:48.6378132Z     compiled=True,
2025-05-07T20:32:48.6378343Z )
2025-05-07T20:32:48.7879771Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.7881079Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:48.7881690Z 
2025-05-07T20:32:48.7881856Z     @given(
2025-05-07T20:32:48.7882320Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.7882826Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.7883175Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.7883526Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.7883872Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.7884168Z     )
2025-05-07T20:32:48.7884528Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.7884981Z     def test_silu_mul_quant(
2025-05-07T20:32:48.7885382Z         self,
2025-05-07T20:32:48.7885582Z         T: int,
2025-05-07T20:32:48.7885784Z         D: int,
2025-05-07T20:32:48.7885992Z         scale_ub: Optional[float],
2025-05-07T20:32:48.7886269Z         contiguous: bool,
2025-05-07T20:32:48.7886510Z         compiled: bool,
2025-05-07T20:32:48.7886850Z     ) -> None:
2025-05-07T20:32:48.7887061Z         torch.manual_seed(2025)
2025-05-07T20:32:48.7887308Z     
2025-05-07T20:32:48.7887571Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.7887927Z     
2025-05-07T20:32:48.7888122Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.7888414Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.7888727Z         x = x_sign * x_clamp
2025-05-07T20:32:48.7888965Z         x0 = x[:, :D]
2025-05-07T20:32:48.7889182Z         x1 = x[:, D:]
2025-05-07T20:32:48.7889388Z     
2025-05-07T20:32:48.7889571Z         if contiguous:
2025-05-07T20:32:48.7889808Z             x0 = x0.contiguous()
2025-05-07T20:32:48.7890072Z             x1 = x1.contiguous()
2025-05-07T20:32:48.7890312Z     
2025-05-07T20:32:48.7890504Z         if scale_ub is not None:
2025-05-07T20:32:48.7890772Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.7891114Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.7891435Z             )
2025-05-07T20:32:48.7891620Z         else:
2025-05-07T20:32:48.7891829Z             scale_ub_tensor = None
2025-05-07T20:32:48.7892082Z     
2025-05-07T20:32:48.7892307Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.7892626Z             op = silu_mul_quant
2025-05-07T20:32:48.7892875Z             if compiled:
2025-05-07T20:32:48.7893129Z                 op = torch.compile(op)
2025-05-07T20:32:48.7893425Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.7893711Z     
2025-05-07T20:32:48.7893901Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.7894070Z 
2025-05-07T20:32:48.7894172Z moe/activation_test.py:117: 
2025-05-07T20:32:48.7894590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.7894930Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.7895216Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.7895800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:48.7896392Z     return fn(*args, **kwargs)
2025-05-07T20:32:48.7897077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.7897794Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.7898352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.7899070Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.7899762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.7900324Z     kernel = self.compile(
2025-05-07T20:32:48.7900885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.7901569Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.7901983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.7902232Z 
2025-05-07T20:32:48.7902438Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278d24bf0>
2025-05-07T20:32:48.7903611Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.7905120Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1279459f80>}
2025-05-07T20:32:48.7906523Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.7907710Z context = <triton._C.libtriton.ir.context object at 0x7f1278765870>
2025-05-07T20:32:48.7908015Z 
2025-05-07T20:32:48.7908189Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.7908732Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.7909211Z                            module_map=module_map)
2025-05-07T20:32:48.7909591Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.7909968Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.7910240Z E       ^
2025-05-07T20:32:48.7910722Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.7911199Z 
2025-05-07T20:32:48.7911641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.7912183Z 
2025-05-07T20:32:48.7912301Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.7912728Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.7913138Z     T=1,
2025-05-07T20:32:48.7913345Z     D=5120,
2025-05-07T20:32:48.7913569Z     scale_ub=1200.0,
2025-05-07T20:32:48.7913798Z     contiguous=False,
2025-05-07T20:32:48.7914023Z     compiled=False,
2025-05-07T20:32:48.7914236Z )
2025-05-07T20:32:48.7914564Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.7915069Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:48.7915345Z 
2025-05-07T20:32:48.7915426Z     @given(
2025-05-07T20:32:48.7915656Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.7915980Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.7916290Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.7916630Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.7916974Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.7917268Z     )
2025-05-07T20:32:48.7917626Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.7918086Z     def test_silu_mul_quant(
2025-05-07T20:32:48.7918336Z         self,
2025-05-07T20:32:48.7918532Z         T: int,
2025-05-07T20:32:48.7918726Z         D: int,
2025-05-07T20:32:48.7918943Z         scale_ub: Optional[float],
2025-05-07T20:32:48.7919212Z         contiguous: bool,
2025-05-07T20:32:48.7919451Z         compiled: bool,
2025-05-07T20:32:48.7919669Z     ) -> None:
2025-05-07T20:32:48.7919876Z         torch.manual_seed(2025)
2025-05-07T20:32:48.7920115Z     
2025-05-07T20:32:48.7920396Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.7920744Z     
2025-05-07T20:32:48.7920934Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.7921222Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.7921531Z         x = x_sign * x_clamp
2025-05-07T20:32:48.7921772Z         x0 = x[:, :D]
2025-05-07T20:32:48.7921984Z         x1 = x[:, D:]
2025-05-07T20:32:48.7922186Z     
2025-05-07T20:32:48.7922365Z         if contiguous:
2025-05-07T20:32:48.7922589Z             x0 = x0.contiguous()
2025-05-07T20:32:48.7922840Z             x1 = x1.contiguous()
2025-05-07T20:32:48.7923083Z     
2025-05-07T20:32:48.7923295Z         if scale_ub is not None:
2025-05-07T20:32:48.7923583Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.7923927Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.7924238Z             )
2025-05-07T20:32:48.7924431Z         else:
2025-05-07T20:32:48.7924641Z             scale_ub_tensor = None
2025-05-07T20:32:48.7924992Z     
2025-05-07T20:32:48.7925227Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.7925708Z             op = silu_mul_quant
2025-05-07T20:32:48.7925955Z             if compiled:
2025-05-07T20:32:48.7926203Z                 op = torch.compile(op)
2025-05-07T20:32:48.7926633Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.7926918Z     
2025-05-07T20:32:48.7927120Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.7927289Z 
2025-05-07T20:32:48.7927393Z moe/activation_test.py:117: 
2025-05-07T20:32:48.7927701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.7928048Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.7928336Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.7929053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.7929775Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.7930341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.7931052Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.7931752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.7932307Z     kernel = self.compile(
2025-05-07T20:32:48.7932869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.7933554Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.7933966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.7934205Z 
2025-05-07T20:32:48.7934496Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278eb0fe0>
2025-05-07T20:32:48.7935619Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.7937037Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f12793bae80>}
2025-05-07T20:32:48.7938448Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.7939528Z context = <triton._C.libtriton.ir.context object at 0x7f1278825570>
2025-05-07T20:32:48.7939824Z 
2025-05-07T20:32:48.7939997Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.7940526Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.7941013Z                            module_map=module_map)
2025-05-07T20:32:48.7941377Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.7941741Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.7941998Z E       ^
2025-05-07T20:32:48.7942470Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.7942948Z 
2025-05-07T20:32:48.7943391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.7943930Z 
2025-05-07T20:32:48.7944037Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.7944453Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.7944869Z     T=16384,
2025-05-07T20:32:48.7945060Z     D=5120,
2025-05-07T20:32:48.7945243Z     scale_ub=1200.0,
2025-05-07T20:32:48.7945461Z     contiguous=False,
2025-05-07T20:32:48.7945681Z     compiled=True,
2025-05-07T20:32:48.7945878Z )
2025-05-07T20:32:48.8812338Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.8813055Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:48.8813831Z 
2025-05-07T20:32:48.8814003Z     @given(
2025-05-07T20:32:48.8814860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.8815479Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.8816088Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.8816742Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.8817384Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.8817952Z     )
2025-05-07T20:32:48.8818646Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.8819532Z     def test_silu_mul_quant(
2025-05-07T20:32:48.8820001Z         self,
2025-05-07T20:32:48.8820374Z         T: int,
2025-05-07T20:32:48.8820740Z         D: int,
2025-05-07T20:32:48.8821170Z         scale_ub: Optional[float],
2025-05-07T20:32:48.8821711Z         contiguous: bool,
2025-05-07T20:32:48.8822170Z         compiled: bool,
2025-05-07T20:32:48.8822599Z     ) -> None:
2025-05-07T20:32:48.8822840Z         torch.manual_seed(2025)
2025-05-07T20:32:48.8823124Z     
2025-05-07T20:32:48.8823401Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.8823752Z     
2025-05-07T20:32:48.8823939Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.8824226Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.8824546Z         x = x_sign * x_clamp
2025-05-07T20:32:48.8824782Z         x0 = x[:, :D]
2025-05-07T20:32:48.8824990Z         x1 = x[:, D:]
2025-05-07T20:32:48.8825197Z     
2025-05-07T20:32:48.8825379Z         if contiguous:
2025-05-07T20:32:48.8825771Z             x0 = x0.contiguous()
2025-05-07T20:32:48.8826032Z             x1 = x1.contiguous()
2025-05-07T20:32:48.8826279Z     
2025-05-07T20:32:48.8826477Z         if scale_ub is not None:
2025-05-07T20:32:48.8826756Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.8827092Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.8827399Z             )
2025-05-07T20:32:48.8827592Z         else:
2025-05-07T20:32:48.8827809Z             scale_ub_tensor = None
2025-05-07T20:32:48.8828061Z     
2025-05-07T20:32:48.8828293Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.8828609Z             op = silu_mul_quant
2025-05-07T20:32:48.8828851Z             if compiled:
2025-05-07T20:32:48.8829098Z                 op = torch.compile(op)
2025-05-07T20:32:48.8829390Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.8829668Z     
2025-05-07T20:32:48.8829848Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.8830016Z 
2025-05-07T20:32:48.8830113Z moe/activation_test.py:117: 
2025-05-07T20:32:48.8830409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.8830745Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.8831023Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.8831609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:48.8832193Z     return fn(*args, **kwargs)
2025-05-07T20:32:48.8832875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.8833599Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.8834161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.8834867Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.8835559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.8836112Z     kernel = self.compile(
2025-05-07T20:32:48.8836790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.8837481Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.8837895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.8838249Z 
2025-05-07T20:32:48.8838465Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278eb1d30>
2025-05-07T20:32:48.8839586Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.8841007Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f12799ba8e0>}
2025-05-07T20:32:48.8842421Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.8843507Z context = <triton._C.libtriton.ir.context object at 0x7f12788dd570>
2025-05-07T20:32:48.8843809Z 
2025-05-07T20:32:48.8843982Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.8844516Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.8845001Z                            module_map=module_map)
2025-05-07T20:32:48.8845372Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.8845734Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.8845995Z E       ^
2025-05-07T20:32:48.8846473Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.8846942Z 
2025-05-07T20:32:48.8847396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.8847938Z 
2025-05-07T20:32:48.8848041Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:48.8848462Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:48.8848885Z     T=2048,
2025-05-07T20:32:48.8849073Z     D=7168,
2025-05-07T20:32:48.8849262Z     scale_ub=1200.0,
2025-05-07T20:32:48.8849485Z     contiguous=False,
2025-05-07T20:32:48.8849708Z     compiled=True,
2025-05-07T20:32:48.8849909Z )
2025-05-07T20:32:48.8850230Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:48.8850739Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:48.8851024Z 
2025-05-07T20:32:48.8851103Z     @given(
2025-05-07T20:32:48.8851333Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:48.8851653Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:48.8851966Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:48.8852298Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:48.8852633Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:48.8852933Z     )
2025-05-07T20:32:48.8853315Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:48.8853775Z     def test_silu_mul_quant(
2025-05-07T20:32:48.8854018Z         self,
2025-05-07T20:32:48.8854208Z         T: int,
2025-05-07T20:32:48.8854479Z         D: int,
2025-05-07T20:32:48.8854699Z         scale_ub: Optional[float],
2025-05-07T20:32:48.8854963Z         contiguous: bool,
2025-05-07T20:32:48.8855201Z         compiled: bool,
2025-05-07T20:32:48.8855421Z     ) -> None:
2025-05-07T20:32:48.8855635Z         torch.manual_seed(2025)
2025-05-07T20:32:48.8855877Z     
2025-05-07T20:32:48.8856171Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:48.8856523Z     
2025-05-07T20:32:48.8856835Z         x_sign = torch.sign(x)
2025-05-07T20:32:48.8864160Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:48.8864494Z         x = x_sign * x_clamp
2025-05-07T20:32:48.8864737Z         x0 = x[:, :D]
2025-05-07T20:32:48.8864965Z         x1 = x[:, D:]
2025-05-07T20:32:48.8865302Z     
2025-05-07T20:32:48.8865510Z         if contiguous:
2025-05-07T20:32:48.8865753Z             x0 = x0.contiguous()
2025-05-07T20:32:48.8866027Z             x1 = x1.contiguous()
2025-05-07T20:32:48.8866286Z     
2025-05-07T20:32:48.8866485Z         if scale_ub is not None:
2025-05-07T20:32:48.8866773Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:48.8867123Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:48.8867437Z             )
2025-05-07T20:32:48.8867638Z         else:
2025-05-07T20:32:48.8867848Z             scale_ub_tensor = None
2025-05-07T20:32:48.8868100Z     
2025-05-07T20:32:48.8868335Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:48.8868663Z             op = silu_mul_quant
2025-05-07T20:32:48.8868917Z             if compiled:
2025-05-07T20:32:48.8869166Z                 op = torch.compile(op)
2025-05-07T20:32:48.8869464Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.8869745Z     
2025-05-07T20:32:48.8869942Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:48.8870113Z 
2025-05-07T20:32:48.8870213Z moe/activation_test.py:117: 
2025-05-07T20:32:48.8870521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.8870859Z moe/activation_test.py:115: in fn
2025-05-07T20:32:48.8871148Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:48.8871734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:48.8872320Z     return fn(*args, **kwargs)
2025-05-07T20:32:48.8873007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:48.8873740Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:48.8874301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:48.8875007Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:48.8875714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:48.8876274Z     kernel = self.compile(
2025-05-07T20:32:48.8876832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:48.8877525Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:48.8877939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:48.8878181Z 
2025-05-07T20:32:48.8878395Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278eb3e60>
2025-05-07T20:32:48.8879520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:48.8880955Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1279602840>}
2025-05-07T20:32:48.8882367Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:48.8883509Z context = <triton._C.libtriton.ir.context object at 0x7f121bf06e30>
2025-05-07T20:32:48.8883811Z 
2025-05-07T20:32:48.8883985Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:48.8884597Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:48.8885090Z                            module_map=module_map)
2025-05-07T20:32:48.8885475Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:48.8885845Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:48.8886212Z E       ^
2025-05-07T20:32:48.8886705Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:48.8887179Z 
2025-05-07T20:32:48.8887622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:48.8888165Z 
2025-05-07T20:32:49.0026524Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.0026985Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.0027509Z     T=1,
2025-05-07T20:32:49.0027782Z     D=5120,
2025-05-07T20:32:49.0028026Z     scale_ub=None,
2025-05-07T20:32:49.0028241Z     contiguous=False,
2025-05-07T20:32:49.0028478Z     compiled=False,
2025-05-07T20:32:49.0028684Z )
2025-05-07T20:32:49.0029003Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.0029508Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:49.0029794Z 
2025-05-07T20:32:49.0029872Z     @given(
2025-05-07T20:32:49.0030099Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.0030413Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.0030720Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.0031056Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.0031380Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.0031669Z     )
2025-05-07T20:32:49.0032019Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.0032473Z     def test_silu_mul_quant(
2025-05-07T20:32:49.0032711Z         self,
2025-05-07T20:32:49.0032905Z         T: int,
2025-05-07T20:32:49.0033107Z         D: int,
2025-05-07T20:32:49.0033317Z         scale_ub: Optional[float],
2025-05-07T20:32:49.0033591Z         contiguous: bool,
2025-05-07T20:32:49.0033830Z         compiled: bool,
2025-05-07T20:32:49.0034045Z     ) -> None:
2025-05-07T20:32:49.0034261Z         torch.manual_seed(2025)
2025-05-07T20:32:49.0034502Z     
2025-05-07T20:32:49.0034772Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.0035126Z     
2025-05-07T20:32:49.0035321Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.0035603Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.0035922Z         x = x_sign * x_clamp
2025-05-07T20:32:49.0036165Z         x0 = x[:, :D]
2025-05-07T20:32:49.0036372Z         x1 = x[:, D:]
2025-05-07T20:32:49.0036578Z     
2025-05-07T20:32:49.0036764Z         if contiguous:
2025-05-07T20:32:49.0036989Z             x0 = x0.contiguous()
2025-05-07T20:32:49.0037250Z             x1 = x1.contiguous()
2025-05-07T20:32:49.0037502Z     
2025-05-07T20:32:49.0037686Z         if scale_ub is not None:
2025-05-07T20:32:49.0037959Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.0038299Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.0038622Z             )
2025-05-07T20:32:49.0038810Z         else:
2025-05-07T20:32:49.0039019Z             scale_ub_tensor = None
2025-05-07T20:32:49.0039270Z     
2025-05-07T20:32:49.0039492Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.0039820Z             op = silu_mul_quant
2025-05-07T20:32:49.0040073Z             if compiled:
2025-05-07T20:32:49.0040317Z                 op = torch.compile(op)
2025-05-07T20:32:49.0040618Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.0040900Z     
2025-05-07T20:32:49.0041086Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.0041260Z 
2025-05-07T20:32:49.0041361Z moe/activation_test.py:117: 
2025-05-07T20:32:49.0041856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.0042200Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.0042478Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.0043197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.0044040Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.0044595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.0045311Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.0046001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.0046557Z     kernel = self.compile(
2025-05-07T20:32:49.0047116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.0047802Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.0048210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.0048447Z 
2025-05-07T20:32:49.0048655Z self = <triton.compiler.compiler.ASTSource object at 0x7f12793b3a40>
2025-05-07T20:32:49.0049784Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.0051211Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127a0362a0>}
2025-05-07T20:32:49.0052618Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.0053750Z context = <triton._C.libtriton.ir.context object at 0x7f121bf21970>
2025-05-07T20:32:49.0054046Z 
2025-05-07T20:32:49.0054214Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.0054899Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.0055381Z                            module_map=module_map)
2025-05-07T20:32:49.0055745Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.0056102Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.0056361Z E       ^
2025-05-07T20:32:49.0056840Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.0057311Z 
2025-05-07T20:32:49.0057747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.0058295Z 
2025-05-07T20:32:49.0058405Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.0058829Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.0059241Z     T=4096,
2025-05-07T20:32:49.0059428Z     D=7168,
2025-05-07T20:32:49.0059619Z     scale_ub=1200.0,
2025-05-07T20:32:49.0059843Z     contiguous=False,
2025-05-07T20:32:49.0060058Z     compiled=False,
2025-05-07T20:32:49.0060264Z )
2025-05-07T20:32:49.0060588Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.0061101Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:49.0061398Z 
2025-05-07T20:32:49.0061476Z     @given(
2025-05-07T20:32:49.0061708Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.0062018Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.0062325Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.0062661Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.0063113Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.0063419Z     )
2025-05-07T20:32:49.0063769Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.0064226Z     def test_silu_mul_quant(
2025-05-07T20:32:49.0064540Z         self,
2025-05-07T20:32:49.0064747Z         T: int,
2025-05-07T20:32:49.0064949Z         D: int,
2025-05-07T20:32:49.0065164Z         scale_ub: Optional[float],
2025-05-07T20:32:49.0065449Z         contiguous: bool,
2025-05-07T20:32:49.0065693Z         compiled: bool,
2025-05-07T20:32:49.0065912Z     ) -> None:
2025-05-07T20:32:49.0066128Z         torch.manual_seed(2025)
2025-05-07T20:32:49.0066373Z     
2025-05-07T20:32:49.0066640Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.0066999Z     
2025-05-07T20:32:49.0067189Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.0067482Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.0067798Z         x = x_sign * x_clamp
2025-05-07T20:32:49.0068037Z         x0 = x[:, :D]
2025-05-07T20:32:49.0068255Z         x1 = x[:, D:]
2025-05-07T20:32:49.0068456Z     
2025-05-07T20:32:49.0068636Z         if contiguous:
2025-05-07T20:32:49.0068865Z             x0 = x0.contiguous()
2025-05-07T20:32:49.0069127Z             x1 = x1.contiguous()
2025-05-07T20:32:49.0069373Z     
2025-05-07T20:32:49.0069567Z         if scale_ub is not None:
2025-05-07T20:32:49.0069835Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.0070169Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.0070485Z             )
2025-05-07T20:32:49.0070683Z         else:
2025-05-07T20:32:49.0070896Z             scale_ub_tensor = None
2025-05-07T20:32:49.0071151Z     
2025-05-07T20:32:49.0071373Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.0071687Z             op = silu_mul_quant
2025-05-07T20:32:49.0071941Z             if compiled:
2025-05-07T20:32:49.0072190Z                 op = torch.compile(op)
2025-05-07T20:32:49.0072482Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.0072759Z     
2025-05-07T20:32:49.0072945Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.0073108Z 
2025-05-07T20:32:49.0073204Z moe/activation_test.py:117: 
2025-05-07T20:32:49.0073505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.0073843Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.0074119Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.0074833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.0075556Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.0076112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.0076823Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.0077523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.0078084Z     kernel = self.compile(
2025-05-07T20:32:49.0078641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.0079332Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.0079740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.0079978Z 
2025-05-07T20:32:49.0080191Z self = <triton.compiler.compiler.ASTSource object at 0x7f12793b0f20>
2025-05-07T20:32:49.0081306Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.0082849Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127a0ef100>}
2025-05-07T20:32:49.0084306Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.0085468Z context = <triton._C.libtriton.ir.context object at 0x7f127810f430>
2025-05-07T20:32:49.0085769Z 
2025-05-07T20:32:49.0085944Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.0086482Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.0086968Z                            module_map=module_map)
2025-05-07T20:32:49.0087344Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.0087707Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.0087972Z E       ^
2025-05-07T20:32:49.0088459Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.0088932Z 
2025-05-07T20:32:49.0089377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.0089929Z 
2025-05-07T20:32:49.0090034Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.0090458Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.0090874Z     T=16384,
2025-05-07T20:32:49.0091070Z     D=7168,
2025-05-07T20:32:49.0091259Z     scale_ub=None,
2025-05-07T20:32:49.0091474Z     contiguous=True,
2025-05-07T20:32:49.0091697Z     compiled=True,
2025-05-07T20:32:49.0091895Z )
2025-05-07T20:32:49.1832265Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.1832838Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:49.1833236Z 
2025-05-07T20:32:49.1833350Z     @given(
2025-05-07T20:32:49.1833679Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.1834008Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.1834318Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.1834661Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.1834997Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.1835282Z     )
2025-05-07T20:32:49.1835635Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.1836094Z     def test_silu_mul_quant(
2025-05-07T20:32:49.1836344Z         self,
2025-05-07T20:32:49.1836542Z         T: int,
2025-05-07T20:32:49.1836744Z         D: int,
2025-05-07T20:32:49.1836960Z         scale_ub: Optional[float],
2025-05-07T20:32:49.1837247Z         contiguous: bool,
2025-05-07T20:32:49.1837495Z         compiled: bool,
2025-05-07T20:32:49.1837723Z     ) -> None:
2025-05-07T20:32:49.1837936Z         torch.manual_seed(2025)
2025-05-07T20:32:49.1838180Z     
2025-05-07T20:32:49.1838456Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.1838800Z     
2025-05-07T20:32:49.1838995Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.1839283Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.1839594Z         x = x_sign * x_clamp
2025-05-07T20:32:49.1839826Z         x0 = x[:, :D]
2025-05-07T20:32:49.1840037Z         x1 = x[:, D:]
2025-05-07T20:32:49.1840238Z     
2025-05-07T20:32:49.1840421Z         if contiguous:
2025-05-07T20:32:49.1840649Z             x0 = x0.contiguous()
2025-05-07T20:32:49.1840903Z             x1 = x1.contiguous()
2025-05-07T20:32:49.1841141Z     
2025-05-07T20:32:49.1841325Z         if scale_ub is not None:
2025-05-07T20:32:49.1841587Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.1841923Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.1842239Z             )
2025-05-07T20:32:49.1842429Z         else:
2025-05-07T20:32:49.1842797Z             scale_ub_tensor = None
2025-05-07T20:32:49.1843062Z     
2025-05-07T20:32:49.1843290Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.1843602Z             op = silu_mul_quant
2025-05-07T20:32:49.1843964Z             if compiled:
2025-05-07T20:32:49.1844210Z                 op = torch.compile(op)
2025-05-07T20:32:49.1844501Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.1844778Z     
2025-05-07T20:32:49.1844972Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.1845137Z 
2025-05-07T20:32:49.1845234Z moe/activation_test.py:117: 
2025-05-07T20:32:49.1845530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.1845867Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.1846145Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.1846720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:49.1847310Z     return fn(*args, **kwargs)
2025-05-07T20:32:49.1847992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.1848707Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.1849263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.1849972Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.1850661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.1851214Z     kernel = self.compile(
2025-05-07T20:32:49.1851770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.1852457Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.1852868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.1853111Z 
2025-05-07T20:32:49.1853316Z self = <triton.compiler.compiler.ASTSource object at 0x7f1279373740>
2025-05-07T20:32:49.1854533Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.1855958Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127a6a5300>}
2025-05-07T20:32:49.1857364Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.1858434Z context = <triton._C.libtriton.ir.context object at 0x7f1278170130>
2025-05-07T20:32:49.1858734Z 
2025-05-07T20:32:49.1858906Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.1859437Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.1859915Z                            module_map=module_map)
2025-05-07T20:32:49.1860284Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.1860640Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.1860893Z E       ^
2025-05-07T20:32:49.1861359Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.1861827Z 
2025-05-07T20:32:49.1862261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.1862803Z 
2025-05-07T20:32:49.1862905Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.1863404Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.1863813Z     T=4096,
2025-05-07T20:32:49.1863998Z     D=5120,
2025-05-07T20:32:49.1864187Z     scale_ub=None,
2025-05-07T20:32:49.1864393Z     contiguous=False,
2025-05-07T20:32:49.1864640Z     compiled=True,
2025-05-07T20:32:49.1864914Z )
2025-05-07T20:32:49.1865236Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.1865737Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:49.1866024Z 
2025-05-07T20:32:49.1866101Z     @given(
2025-05-07T20:32:49.1866324Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.1866635Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.1866942Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.1867269Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.1867599Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.1867881Z     )
2025-05-07T20:32:49.1868238Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.1868687Z     def test_silu_mul_quant(
2025-05-07T20:32:49.1868920Z         self,
2025-05-07T20:32:49.1869112Z         T: int,
2025-05-07T20:32:49.1869308Z         D: int,
2025-05-07T20:32:49.1869522Z         scale_ub: Optional[float],
2025-05-07T20:32:49.1869793Z         contiguous: bool,
2025-05-07T20:32:49.1870030Z         compiled: bool,
2025-05-07T20:32:49.1870242Z     ) -> None:
2025-05-07T20:32:49.1870450Z         torch.manual_seed(2025)
2025-05-07T20:32:49.1870689Z     
2025-05-07T20:32:49.1870956Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.1871312Z     
2025-05-07T20:32:49.1871508Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.1871797Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.1872107Z         x = x_sign * x_clamp
2025-05-07T20:32:49.1872356Z         x0 = x[:, :D]
2025-05-07T20:32:49.1872571Z         x1 = x[:, D:]
2025-05-07T20:32:49.1872782Z     
2025-05-07T20:32:49.1872969Z         if contiguous:
2025-05-07T20:32:49.1873204Z             x0 = x0.contiguous()
2025-05-07T20:32:49.1873485Z             x1 = x1.contiguous()
2025-05-07T20:32:49.1873758Z     
2025-05-07T20:32:49.1873952Z         if scale_ub is not None:
2025-05-07T20:32:49.1874231Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.1874575Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.1874895Z             )
2025-05-07T20:32:49.1875095Z         else:
2025-05-07T20:32:49.1875311Z             scale_ub_tensor = None
2025-05-07T20:32:49.1875573Z     
2025-05-07T20:32:49.1875802Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.1876120Z             op = silu_mul_quant
2025-05-07T20:32:49.1876370Z             if compiled:
2025-05-07T20:32:49.1876617Z                 op = torch.compile(op)
2025-05-07T20:32:49.1876910Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.1877191Z     
2025-05-07T20:32:49.1877385Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.1877548Z 
2025-05-07T20:32:49.1877648Z moe/activation_test.py:117: 
2025-05-07T20:32:49.1877948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.1878297Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.1878575Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.1879156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:49.1879743Z     return fn(*args, **kwargs)
2025-05-07T20:32:49.1880425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.1881153Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.1881708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.1882504Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.1883246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.1883809Z     kernel = self.compile(
2025-05-07T20:32:49.1884472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.1885159Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.1885572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.1885820Z 
2025-05-07T20:32:49.1886031Z self = <triton.compiler.compiler.ASTSource object at 0x7f12796212b0>
2025-05-07T20:32:49.1893566Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.1895094Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ac44360>}
2025-05-07T20:32:49.1896515Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.1897607Z context = <triton._C.libtriton.ir.context object at 0x7f1278278030>
2025-05-07T20:32:49.1897911Z 
2025-05-07T20:32:49.1898087Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.1898632Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.1899122Z                            module_map=module_map)
2025-05-07T20:32:49.1899490Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.1899852Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.1900128Z E       ^
2025-05-07T20:32:49.1900606Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.1901078Z 
2025-05-07T20:32:49.1901515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.1902067Z 
2025-05-07T20:32:49.3350147Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.3350599Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.3351176Z     T=4096,
2025-05-07T20:32:49.3351445Z     D=5120,
2025-05-07T20:32:49.3351648Z     scale_ub=1200.0,
2025-05-07T20:32:49.3351872Z     contiguous=False,
2025-05-07T20:32:49.3352098Z     compiled=False,
2025-05-07T20:32:49.3352311Z )
2025-05-07T20:32:49.3352626Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.3353138Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:49.3353435Z 
2025-05-07T20:32:49.3353517Z     @given(
2025-05-07T20:32:49.3353741Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.3354061Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.3354375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.3354716Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.3355050Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.3355344Z     )
2025-05-07T20:32:49.3355701Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.3356151Z     def test_silu_mul_quant(
2025-05-07T20:32:49.3356399Z         self,
2025-05-07T20:32:49.3356596Z         T: int,
2025-05-07T20:32:49.3356792Z         D: int,
2025-05-07T20:32:49.3357014Z         scale_ub: Optional[float],
2025-05-07T20:32:49.3357290Z         contiguous: bool,
2025-05-07T20:32:49.3357530Z         compiled: bool,
2025-05-07T20:32:49.3357755Z     ) -> None:
2025-05-07T20:32:49.3358148Z         torch.manual_seed(2025)
2025-05-07T20:32:49.3358400Z     
2025-05-07T20:32:49.3358677Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.3359039Z     
2025-05-07T20:32:49.3359241Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.3359648Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.3359973Z         x = x_sign * x_clamp
2025-05-07T20:32:49.3360218Z         x0 = x[:, :D]
2025-05-07T20:32:49.3360436Z         x1 = x[:, D:]
2025-05-07T20:32:49.3360647Z     
2025-05-07T20:32:49.3360836Z         if contiguous:
2025-05-07T20:32:49.3361068Z             x0 = x0.contiguous()
2025-05-07T20:32:49.3361342Z             x1 = x1.contiguous()
2025-05-07T20:32:49.3361595Z     
2025-05-07T20:32:49.3361791Z         if scale_ub is not None:
2025-05-07T20:32:49.3362074Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.3362415Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.3362733Z             )
2025-05-07T20:32:49.3362931Z         else:
2025-05-07T20:32:49.3363147Z             scale_ub_tensor = None
2025-05-07T20:32:49.3363415Z     
2025-05-07T20:32:49.3363696Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.3364031Z             op = silu_mul_quant
2025-05-07T20:32:49.3364292Z             if compiled:
2025-05-07T20:32:49.3364546Z                 op = torch.compile(op)
2025-05-07T20:32:49.3364852Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.3365140Z     
2025-05-07T20:32:49.3365334Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.3365504Z 
2025-05-07T20:32:49.3365604Z moe/activation_test.py:117: 
2025-05-07T20:32:49.3365906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.3366241Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.3366532Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.3367265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.3367999Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.3368557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.3369280Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.3369978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.3370535Z     kernel = self.compile(
2025-05-07T20:32:49.3371105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.3371795Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.3372208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.3372444Z 
2025-05-07T20:32:49.3372661Z self = <triton.compiler.compiler.ASTSource object at 0x7f1279623d10>
2025-05-07T20:32:49.3373836Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.3375374Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ac46700>}
2025-05-07T20:32:49.3376791Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.3377879Z context = <triton._C.libtriton.ir.context object at 0x7f127831f430>
2025-05-07T20:32:49.3378179Z 
2025-05-07T20:32:49.3378349Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.3378978Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.3379466Z                            module_map=module_map)
2025-05-07T20:32:49.3379841Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.3380282Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.3380553Z E       ^
2025-05-07T20:32:49.3381032Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.3381508Z 
2025-05-07T20:32:49.3381946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.3382492Z 
2025-05-07T20:32:49.3382597Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.3383025Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.3383463Z     T=4096,
2025-05-07T20:32:49.3383686Z     D=5120,
2025-05-07T20:32:49.3383888Z     scale_ub=1200.0,
2025-05-07T20:32:49.3384114Z     contiguous=False,
2025-05-07T20:32:49.3384345Z     compiled=True,
2025-05-07T20:32:49.3384556Z )
2025-05-07T20:32:49.3384886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.3385402Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:49.3385691Z 
2025-05-07T20:32:49.3385771Z     @given(
2025-05-07T20:32:49.3386005Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.3386324Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.3386649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.3386986Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.3387321Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.3387611Z     )
2025-05-07T20:32:49.3387969Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.3388421Z     def test_silu_mul_quant(
2025-05-07T20:32:49.3388666Z         self,
2025-05-07T20:32:49.3388866Z         T: int,
2025-05-07T20:32:49.3389059Z         D: int,
2025-05-07T20:32:49.3389274Z         scale_ub: Optional[float],
2025-05-07T20:32:49.3389551Z         contiguous: bool,
2025-05-07T20:32:49.3389795Z         compiled: bool,
2025-05-07T20:32:49.3390016Z     ) -> None:
2025-05-07T20:32:49.3390232Z         torch.manual_seed(2025)
2025-05-07T20:32:49.3390477Z     
2025-05-07T20:32:49.3390746Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.3391104Z     
2025-05-07T20:32:49.3391297Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.3391584Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.3391899Z         x = x_sign * x_clamp
2025-05-07T20:32:49.3392136Z         x0 = x[:, :D]
2025-05-07T20:32:49.3392347Z         x1 = x[:, D:]
2025-05-07T20:32:49.3392556Z     
2025-05-07T20:32:49.3392743Z         if contiguous:
2025-05-07T20:32:49.3392974Z             x0 = x0.contiguous()
2025-05-07T20:32:49.3393252Z             x1 = x1.contiguous()
2025-05-07T20:32:49.3393533Z     
2025-05-07T20:32:49.3393750Z         if scale_ub is not None:
2025-05-07T20:32:49.3394024Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.3394371Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.3394689Z             )
2025-05-07T20:32:49.3394879Z         else:
2025-05-07T20:32:49.3395089Z             scale_ub_tensor = None
2025-05-07T20:32:49.3395344Z     
2025-05-07T20:32:49.3395571Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.3395893Z             op = silu_mul_quant
2025-05-07T20:32:49.3396149Z             if compiled:
2025-05-07T20:32:49.3396393Z                 op = torch.compile(op)
2025-05-07T20:32:49.3396691Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.3396970Z     
2025-05-07T20:32:49.3397160Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.3397327Z 
2025-05-07T20:32:49.3397510Z moe/activation_test.py:117: 
2025-05-07T20:32:49.3397811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.3398153Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.3398434Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.3399085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:49.3399673Z     return fn(*args, **kwargs)
2025-05-07T20:32:49.3400354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.3401073Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.3401634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.3402346Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.3403041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.3403651Z     kernel = self.compile(
2025-05-07T20:32:49.3404211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.3404899Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.3405311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.3405553Z 
2025-05-07T20:32:49.3405763Z self = <triton.compiler.compiler.ASTSource object at 0x7f127a0d5f70>
2025-05-07T20:32:49.3406881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.3408310Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127b37b2e0>}
2025-05-07T20:32:49.3409710Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.3410796Z context = <triton._C.libtriton.ir.context object at 0x7f127832e470>
2025-05-07T20:32:49.3411100Z 
2025-05-07T20:32:49.3411271Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.3411805Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.3412283Z                            module_map=module_map)
2025-05-07T20:32:49.3412661Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.3413023Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.3413287Z E       ^
2025-05-07T20:32:49.3413817Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.3414289Z 
2025-05-07T20:32:49.3414813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.3415359Z 
2025-05-07T20:32:49.4547452Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.4547933Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.4548543Z     T=2048,
2025-05-07T20:32:49.4548800Z     D=7168,
2025-05-07T20:32:49.4549045Z     scale_ub=1200.0,
2025-05-07T20:32:49.4549325Z     contiguous=False,
2025-05-07T20:32:49.4549615Z     compiled=False,
2025-05-07T20:32:49.4549865Z )
2025-05-07T20:32:49.4550194Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.4550718Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:49.4551014Z 
2025-05-07T20:32:49.4551098Z     @given(
2025-05-07T20:32:49.4551509Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.4551834Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.4552150Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.4552485Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.4552962Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.4553306Z     )
2025-05-07T20:32:49.4553663Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.4554117Z     def test_silu_mul_quant(
2025-05-07T20:32:49.4554364Z         self,
2025-05-07T20:32:49.4554560Z         T: int,
2025-05-07T20:32:49.4554756Z         D: int,
2025-05-07T20:32:49.4554979Z         scale_ub: Optional[float],
2025-05-07T20:32:49.4555259Z         contiguous: bool,
2025-05-07T20:32:49.4555504Z         compiled: bool,
2025-05-07T20:32:49.4555724Z     ) -> None:
2025-05-07T20:32:49.4555951Z         torch.manual_seed(2025)
2025-05-07T20:32:49.4556197Z     
2025-05-07T20:32:49.4556477Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.4556831Z     
2025-05-07T20:32:49.4557029Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.4557319Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.4557643Z         x = x_sign * x_clamp
2025-05-07T20:32:49.4557886Z         x0 = x[:, :D]
2025-05-07T20:32:49.4558106Z         x1 = x[:, D:]
2025-05-07T20:32:49.4558315Z     
2025-05-07T20:32:49.4558506Z         if contiguous:
2025-05-07T20:32:49.4558737Z             x0 = x0.contiguous()
2025-05-07T20:32:49.4559000Z             x1 = x1.contiguous()
2025-05-07T20:32:49.4559242Z     
2025-05-07T20:32:49.4559430Z         if scale_ub is not None:
2025-05-07T20:32:49.4559705Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.4560041Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.4560351Z             )
2025-05-07T20:32:49.4560552Z         else:
2025-05-07T20:32:49.4560773Z             scale_ub_tensor = None
2025-05-07T20:32:49.4561038Z     
2025-05-07T20:32:49.4561272Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.4561592Z             op = silu_mul_quant
2025-05-07T20:32:49.4561845Z             if compiled:
2025-05-07T20:32:49.4562126Z                 op = torch.compile(op)
2025-05-07T20:32:49.4562419Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.4562701Z     
2025-05-07T20:32:49.4562899Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.4563088Z 
2025-05-07T20:32:49.4563199Z moe/activation_test.py:117: 
2025-05-07T20:32:49.4563518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.4563865Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.4564146Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.4564864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.4565598Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.4566156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.4566869Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.4567571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.4568131Z     kernel = self.compile(
2025-05-07T20:32:49.4568691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.4569378Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.4569781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.4570016Z 
2025-05-07T20:32:49.4570231Z self = <triton.compiler.compiler.ASTSource object at 0x7f127a0d6c30>
2025-05-07T20:32:49.4571432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.4572867Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f129592a340>}
2025-05-07T20:32:49.4574352Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.4575609Z context = <triton._C.libtriton.ir.context object at 0x7f1278143130>
2025-05-07T20:32:49.4575911Z 
2025-05-07T20:32:49.4576085Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.4576626Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.4577120Z                            module_map=module_map)
2025-05-07T20:32:49.4577503Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.4577872Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.4578141Z E       ^
2025-05-07T20:32:49.4578633Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.4579110Z 
2025-05-07T20:32:49.4579552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.4580095Z 
2025-05-07T20:32:49.4580201Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.4580627Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.4581048Z     T=1,
2025-05-07T20:32:49.4581241Z     D=7168,
2025-05-07T20:32:49.4581437Z     scale_ub=None,
2025-05-07T20:32:49.4581659Z     contiguous=True,
2025-05-07T20:32:49.4581889Z     compiled=False,
2025-05-07T20:32:49.4582101Z )
2025-05-07T20:32:49.4582431Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.4582939Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:49.4583210Z 
2025-05-07T20:32:49.4583297Z     @given(
2025-05-07T20:32:49.4583530Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.4583856Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.4584175Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.4584514Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.4584857Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.4585157Z     )
2025-05-07T20:32:49.4585512Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.4585975Z     def test_silu_mul_quant(
2025-05-07T20:32:49.4586226Z         self,
2025-05-07T20:32:49.4586427Z         T: int,
2025-05-07T20:32:49.4586633Z         D: int,
2025-05-07T20:32:49.4586859Z         scale_ub: Optional[float],
2025-05-07T20:32:49.4587131Z         contiguous: bool,
2025-05-07T20:32:49.4587372Z         compiled: bool,
2025-05-07T20:32:49.4587593Z     ) -> None:
2025-05-07T20:32:49.4587804Z         torch.manual_seed(2025)
2025-05-07T20:32:49.4588054Z     
2025-05-07T20:32:49.4588332Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.4588680Z     
2025-05-07T20:32:49.4588883Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.4589177Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.4589490Z         x = x_sign * x_clamp
2025-05-07T20:32:49.4589731Z         x0 = x[:, :D]
2025-05-07T20:32:49.4589952Z         x1 = x[:, D:]
2025-05-07T20:32:49.4590166Z     
2025-05-07T20:32:49.4590345Z         if contiguous:
2025-05-07T20:32:49.4590575Z             x0 = x0.contiguous()
2025-05-07T20:32:49.4590838Z             x1 = x1.contiguous()
2025-05-07T20:32:49.4591080Z     
2025-05-07T20:32:49.4591357Z         if scale_ub is not None:
2025-05-07T20:32:49.4591637Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.4591967Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.4592279Z             )
2025-05-07T20:32:49.4592550Z         else:
2025-05-07T20:32:49.4592757Z             scale_ub_tensor = None
2025-05-07T20:32:49.4593011Z     
2025-05-07T20:32:49.4593240Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.4593579Z             op = silu_mul_quant
2025-05-07T20:32:49.4593858Z             if compiled:
2025-05-07T20:32:49.4594106Z                 op = torch.compile(op)
2025-05-07T20:32:49.4594403Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.4594684Z     
2025-05-07T20:32:49.4594876Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.4595043Z 
2025-05-07T20:32:49.4595143Z moe/activation_test.py:117: 
2025-05-07T20:32:49.4595439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.4595784Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.4596072Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.4596786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.4597523Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.4598079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.4598792Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.4599484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.4600042Z     kernel = self.compile(
2025-05-07T20:32:49.4600606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.4601293Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.4601704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.4601944Z 
2025-05-07T20:32:49.4602153Z self = <triton.compiler.compiler.ASTSource object at 0x7f1279c9aab0>
2025-05-07T20:32:49.4603303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.4604743Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1295a031a0>}
2025-05-07T20:32:49.4606155Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.4607244Z context = <triton._C.libtriton.ir.context object at 0x7f12781598f0>
2025-05-07T20:32:49.4607543Z 
2025-05-07T20:32:49.4607717Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.4608256Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.4608741Z                            module_map=module_map)
2025-05-07T20:32:49.4609110Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.4609471Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.4609731Z E       ^
2025-05-07T20:32:49.4610208Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.4610680Z 
2025-05-07T20:32:49.4611126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.4611666Z 
2025-05-07T20:32:49.4611773Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.4612275Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.4612693Z     T=16384,
2025-05-07T20:32:49.4612888Z     D=7168,
2025-05-07T20:32:49.4613077Z     scale_ub=1200.0,
2025-05-07T20:32:49.4613301Z     contiguous=False,
2025-05-07T20:32:49.4613630Z     compiled=True,
2025-05-07T20:32:49.7008759Z )
2025-05-07T20:32:49.7009457Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.7010579Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:49.7011415Z 
2025-05-07T20:32:49.7011579Z     @given(
2025-05-07T20:32:49.7012037Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.7012715Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.7013253Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.7013592Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.7013930Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.7014233Z     )
2025-05-07T20:32:49.7014700Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.7015153Z     def test_silu_mul_quant(
2025-05-07T20:32:49.7015395Z         self,
2025-05-07T20:32:49.7015597Z         T: int,
2025-05-07T20:32:49.7015792Z         D: int,
2025-05-07T20:32:49.7016011Z         scale_ub: Optional[float],
2025-05-07T20:32:49.7016296Z         contiguous: bool,
2025-05-07T20:32:49.7022931Z         compiled: bool,
2025-05-07T20:32:49.7023215Z     ) -> None:
2025-05-07T20:32:49.7023433Z         torch.manual_seed(2025)
2025-05-07T20:32:49.7023680Z     
2025-05-07T20:32:49.7023961Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.7024321Z     
2025-05-07T20:32:49.7024511Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.7024806Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.7025125Z         x = x_sign * x_clamp
2025-05-07T20:32:49.7025370Z         x0 = x[:, :D]
2025-05-07T20:32:49.7025761Z         x1 = x[:, D:]
2025-05-07T20:32:49.7025974Z     
2025-05-07T20:32:49.7026158Z         if contiguous:
2025-05-07T20:32:49.7026392Z             x0 = x0.contiguous()
2025-05-07T20:32:49.7026657Z             x1 = x1.contiguous()
2025-05-07T20:32:49.7026908Z     
2025-05-07T20:32:49.7027106Z         if scale_ub is not None:
2025-05-07T20:32:49.7027387Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.7027723Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.7028036Z             )
2025-05-07T20:32:49.7028229Z         else:
2025-05-07T20:32:49.7028434Z             scale_ub_tensor = None
2025-05-07T20:32:49.7028688Z     
2025-05-07T20:32:49.7028916Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.7029232Z             op = silu_mul_quant
2025-05-07T20:32:49.7029479Z             if compiled:
2025-05-07T20:32:49.7029726Z                 op = torch.compile(op)
2025-05-07T20:32:49.7030038Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.7030316Z     
2025-05-07T20:32:49.7030509Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.7030677Z 
2025-05-07T20:32:49.7030780Z moe/activation_test.py:117: 
2025-05-07T20:32:49.7031078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.7031421Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.7031705Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.7032282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:49.7032865Z     return fn(*args, **kwargs)
2025-05-07T20:32:49.7033547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.7034269Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.7034989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.7035703Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.7036395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.7037060Z     kernel = self.compile(
2025-05-07T20:32:49.7037610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.7038292Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.7038691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.7038925Z 
2025-05-07T20:32:49.7039132Z self = <triton.compiler.compiler.ASTSource object at 0x7f1279c9ac00>
2025-05-07T20:32:49.7040255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.7041686Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f129587da80>}
2025-05-07T20:32:49.7043096Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.7044178Z context = <triton._C.libtriton.ir.context object at 0x7f12781ba430>
2025-05-07T20:32:49.7044472Z 
2025-05-07T20:32:49.7044641Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.7045174Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.7045652Z                            module_map=module_map)
2025-05-07T20:32:49.7046018Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.7046377Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.7046640Z E       ^
2025-05-07T20:32:49.7047118Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.7047591Z 
2025-05-07T20:32:49.7048028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.7048575Z 
2025-05-07T20:32:49.7048683Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.7049107Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.7049519Z     T=1,
2025-05-07T20:32:49.7049697Z     D=7168,
2025-05-07T20:32:49.7049889Z     scale_ub=None,
2025-05-07T20:32:49.7050102Z     contiguous=False,
2025-05-07T20:32:49.7050321Z     compiled=False,
2025-05-07T20:32:49.7050526Z )
2025-05-07T20:32:49.7050843Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.7051346Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:49.7051624Z 
2025-05-07T20:32:49.7051707Z     @given(
2025-05-07T20:32:49.7051942Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.7052268Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.7052586Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.7052927Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.7053269Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.7053567Z     )
2025-05-07T20:32:49.7053979Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.7054500Z     def test_silu_mul_quant(
2025-05-07T20:32:49.7054744Z         self,
2025-05-07T20:32:49.7054952Z         T: int,
2025-05-07T20:32:49.7055145Z         D: int,
2025-05-07T20:32:49.7055359Z         scale_ub: Optional[float],
2025-05-07T20:32:49.7055633Z         contiguous: bool,
2025-05-07T20:32:49.7055961Z         compiled: bool,
2025-05-07T20:32:49.7056184Z     ) -> None:
2025-05-07T20:32:49.7056399Z         torch.manual_seed(2025)
2025-05-07T20:32:49.7056643Z     
2025-05-07T20:32:49.7056919Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.7057347Z     
2025-05-07T20:32:49.7057540Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.7057833Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.7058144Z         x = x_sign * x_clamp
2025-05-07T20:32:49.7058383Z         x0 = x[:, :D]
2025-05-07T20:32:49.7058600Z         x1 = x[:, D:]
2025-05-07T20:32:49.7058797Z     
2025-05-07T20:32:49.7058980Z         if contiguous:
2025-05-07T20:32:49.7059209Z             x0 = x0.contiguous()
2025-05-07T20:32:49.7059465Z             x1 = x1.contiguous()
2025-05-07T20:32:49.7059714Z     
2025-05-07T20:32:49.7059904Z         if scale_ub is not None:
2025-05-07T20:32:49.7060177Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.7060521Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.7060844Z             )
2025-05-07T20:32:49.7061034Z         else:
2025-05-07T20:32:49.7061243Z             scale_ub_tensor = None
2025-05-07T20:32:49.7061496Z     
2025-05-07T20:32:49.7061729Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.7062045Z             op = silu_mul_quant
2025-05-07T20:32:49.7062300Z             if compiled:
2025-05-07T20:32:49.7062547Z                 op = torch.compile(op)
2025-05-07T20:32:49.7062852Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.7063131Z     
2025-05-07T20:32:49.7063324Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.7063488Z 
2025-05-07T20:32:49.7063591Z moe/activation_test.py:117: 
2025-05-07T20:32:49.7063932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.7064271Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.7064546Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.7065268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.7065995Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.7066552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.7067269Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.7067964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.7068519Z     kernel = self.compile(
2025-05-07T20:32:49.7069074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.7069764Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.7070170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.7070408Z 
2025-05-07T20:32:49.7070622Z self = <triton.compiler.compiler.ASTSource object at 0x7f1279fcb9b0>
2025-05-07T20:32:49.7071737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.7073160Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f12967d0f40>}
2025-05-07T20:32:49.7074561Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.7075648Z context = <triton._C.libtriton.ir.context object at 0x7f121bd739f0>
2025-05-07T20:32:49.7075953Z 
2025-05-07T20:32:49.7076210Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.7076749Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.7077230Z                            module_map=module_map)
2025-05-07T20:32:49.7077672Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.7078027Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.7078290Z E       ^
2025-05-07T20:32:49.7078764Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.7079233Z 
2025-05-07T20:32:49.7079675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.7080214Z 
2025-05-07T20:32:49.7080314Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.7080731Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.7081140Z     T=2048,
2025-05-07T20:32:49.7081329Z     D=7168,
2025-05-07T20:32:49.7081522Z     scale_ub=None,
2025-05-07T20:32:49.7081735Z     contiguous=False,
2025-05-07T20:32:49.7081956Z     compiled=True,
2025-05-07T20:32:49.7082154Z )
2025-05-07T20:32:49.7941126Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.7941717Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:49.7942081Z 
2025-05-07T20:32:49.7942195Z     @given(
2025-05-07T20:32:49.7942500Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.7942818Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.7943135Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.7943467Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.7943808Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.7944101Z     )
2025-05-07T20:32:49.7944457Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.7944915Z     def test_silu_mul_quant(
2025-05-07T20:32:49.7945158Z         self,
2025-05-07T20:32:49.7945351Z         T: int,
2025-05-07T20:32:49.7945547Z         D: int,
2025-05-07T20:32:49.7945764Z         scale_ub: Optional[float],
2025-05-07T20:32:49.7946039Z         contiguous: bool,
2025-05-07T20:32:49.7946283Z         compiled: bool,
2025-05-07T20:32:49.7946501Z     ) -> None:
2025-05-07T20:32:49.7946709Z         torch.manual_seed(2025)
2025-05-07T20:32:49.7946959Z     
2025-05-07T20:32:49.7947233Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.7947591Z     
2025-05-07T20:32:49.7947779Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.7948068Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.7948383Z         x = x_sign * x_clamp
2025-05-07T20:32:49.7948615Z         x0 = x[:, :D]
2025-05-07T20:32:49.7948828Z         x1 = x[:, D:]
2025-05-07T20:32:49.7949043Z     
2025-05-07T20:32:49.7949233Z         if contiguous:
2025-05-07T20:32:49.7949469Z             x0 = x0.contiguous()
2025-05-07T20:32:49.7949739Z             x1 = x1.contiguous()
2025-05-07T20:32:49.7949973Z     
2025-05-07T20:32:49.7950172Z         if scale_ub is not None:
2025-05-07T20:32:49.7950447Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.7950779Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.7951089Z             )
2025-05-07T20:32:49.7951281Z         else:
2025-05-07T20:32:49.7951488Z             scale_ub_tensor = None
2025-05-07T20:32:49.7951736Z     
2025-05-07T20:32:49.7951964Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.7952278Z             op = silu_mul_quant
2025-05-07T20:32:49.7952520Z             if compiled:
2025-05-07T20:32:49.7952764Z                 op = torch.compile(op)
2025-05-07T20:32:49.7953057Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.7953327Z     
2025-05-07T20:32:49.7953519Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.7953906Z 
2025-05-07T20:32:49.7954016Z moe/activation_test.py:117: 
2025-05-07T20:32:49.7954310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.7954647Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.7955081Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.7955666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:49.7956282Z     return fn(*args, **kwargs)
2025-05-07T20:32:49.7956960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.7957688Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.7958246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.7958955Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.7959664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.7960219Z     kernel = self.compile(
2025-05-07T20:32:49.7960777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.7961466Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.7961878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.7962117Z 
2025-05-07T20:32:49.7962332Z self = <triton.compiler.compiler.ASTSource object at 0x7f127a6b2a80>
2025-05-07T20:32:49.7963456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.7964883Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f129f27f7e0>}
2025-05-07T20:32:49.7966285Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.7967375Z context = <triton._C.libtriton.ir.context object at 0x7f1278cb1730>
2025-05-07T20:32:49.7967671Z 
2025-05-07T20:32:49.7967844Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.7968381Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.7968859Z                            module_map=module_map)
2025-05-07T20:32:49.7969225Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.7969583Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.7969835Z E       ^
2025-05-07T20:32:49.7970311Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.7970779Z 
2025-05-07T20:32:49.7971223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.7971768Z 
2025-05-07T20:32:49.7971870Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.7972282Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.7972691Z     T=4096,
2025-05-07T20:32:49.7972875Z     D=7168,
2025-05-07T20:32:49.7973071Z     scale_ub=None,
2025-05-07T20:32:49.7973322Z     contiguous=False,
2025-05-07T20:32:49.7973543Z     compiled=True,
2025-05-07T20:32:49.7973737Z )
2025-05-07T20:32:49.7974057Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.7974658Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:49.7974939Z 
2025-05-07T20:32:49.7975100Z     @given(
2025-05-07T20:32:49.7975330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.7975641Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.7975947Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.7976352Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.7976687Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.7976983Z     )
2025-05-07T20:32:49.7977330Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.7977792Z     def test_silu_mul_quant(
2025-05-07T20:32:49.7978038Z         self,
2025-05-07T20:32:49.7978231Z         T: int,
2025-05-07T20:32:49.7978434Z         D: int,
2025-05-07T20:32:49.7978655Z         scale_ub: Optional[float],
2025-05-07T20:32:49.7978933Z         contiguous: bool,
2025-05-07T20:32:49.7979174Z         compiled: bool,
2025-05-07T20:32:49.7979393Z     ) -> None:
2025-05-07T20:32:49.7979599Z         torch.manual_seed(2025)
2025-05-07T20:32:49.7979843Z     
2025-05-07T20:32:49.7980121Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.7980469Z     
2025-05-07T20:32:49.7980655Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.7980951Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.7981260Z         x = x_sign * x_clamp
2025-05-07T20:32:49.7981493Z         x0 = x[:, :D]
2025-05-07T20:32:49.7981703Z         x1 = x[:, D:]
2025-05-07T20:32:49.7981905Z     
2025-05-07T20:32:49.7982084Z         if contiguous:
2025-05-07T20:32:49.7982316Z             x0 = x0.contiguous()
2025-05-07T20:32:49.7982572Z             x1 = x1.contiguous()
2025-05-07T20:32:49.7982810Z     
2025-05-07T20:32:49.7982998Z         if scale_ub is not None:
2025-05-07T20:32:49.7983267Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.7983628Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.7983956Z             )
2025-05-07T20:32:49.7984160Z         else:
2025-05-07T20:32:49.7984374Z             scale_ub_tensor = None
2025-05-07T20:32:49.7984627Z     
2025-05-07T20:32:49.7984859Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.7985170Z             op = silu_mul_quant
2025-05-07T20:32:49.7985424Z             if compiled:
2025-05-07T20:32:49.7985670Z                 op = torch.compile(op)
2025-05-07T20:32:49.7985967Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.7986237Z     
2025-05-07T20:32:49.7986424Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.7986586Z 
2025-05-07T20:32:49.7986687Z moe/activation_test.py:117: 
2025-05-07T20:32:49.7986981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.7987316Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.7987600Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.7988168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:49.7988760Z     return fn(*args, **kwargs)
2025-05-07T20:32:49.7989442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.7990165Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.7990719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.7991433Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.7992130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.7992684Z     kernel = self.compile(
2025-05-07T20:32:49.7993235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.7993922Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.7994415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.7994655Z 
2025-05-07T20:32:49.7994862Z self = <triton.compiler.compiler.ASTSource object at 0x7f127adec470>
2025-05-07T20:32:49.7995976Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.7997471Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f129674f060>}
2025-05-07T20:32:49.7998873Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.7999954Z context = <triton._C.libtriton.ir.context object at 0x7f1278cf3730>
2025-05-07T20:32:49.8000254Z 
2025-05-07T20:32:49.8000420Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.8000951Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.8001439Z                            module_map=module_map)
2025-05-07T20:32:49.8001798Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.8002153Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.8002408Z E       ^
2025-05-07T20:32:49.8002878Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.8003374Z 
2025-05-07T20:32:49.8003832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.8004375Z 
2025-05-07T20:32:49.9592189Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.9593710Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.9594397Z     T=16384,
2025-05-07T20:32:49.9594622Z     D=5120,
2025-05-07T20:32:49.9594837Z     scale_ub=1200.0,
2025-05-07T20:32:49.9595069Z     contiguous=False,
2025-05-07T20:32:49.9595316Z     compiled=False,
2025-05-07T20:32:49.9595548Z )
2025-05-07T20:32:49.9595881Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.9596423Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:49.9596724Z 
2025-05-07T20:32:49.9596816Z     @given(
2025-05-07T20:32:49.9597055Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.9597393Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.9597721Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.9598071Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.9598418Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.9598727Z     )
2025-05-07T20:32:49.9599100Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.9599558Z     def test_silu_mul_quant(
2025-05-07T20:32:49.9599814Z         self,
2025-05-07T20:32:49.9600018Z         T: int,
2025-05-07T20:32:49.9600224Z         D: int,
2025-05-07T20:32:49.9600461Z         scale_ub: Optional[float],
2025-05-07T20:32:49.9600752Z         contiguous: bool,
2025-05-07T20:32:49.9601000Z         compiled: bool,
2025-05-07T20:32:49.9601242Z     ) -> None:
2025-05-07T20:32:49.9601469Z         torch.manual_seed(2025)
2025-05-07T20:32:49.9601716Z     
2025-05-07T20:32:49.9602005Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.9602371Z     
2025-05-07T20:32:49.9602566Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.9602871Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.9603196Z         x = x_sign * x_clamp
2025-05-07T20:32:49.9603447Z         x0 = x[:, :D]
2025-05-07T20:32:49.9604059Z         x1 = x[:, D:]
2025-05-07T20:32:49.9604282Z     
2025-05-07T20:32:49.9604481Z         if contiguous:
2025-05-07T20:32:49.9604719Z             x0 = x0.contiguous()
2025-05-07T20:32:49.9604997Z             x1 = x1.contiguous()
2025-05-07T20:32:49.9605252Z     
2025-05-07T20:32:49.9605614Z         if scale_ub is not None:
2025-05-07T20:32:49.9605906Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.9606260Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.9606580Z             )
2025-05-07T20:32:49.9606784Z         else:
2025-05-07T20:32:49.9607002Z             scale_ub_tensor = None
2025-05-07T20:32:49.9607262Z     
2025-05-07T20:32:49.9607506Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.9607837Z             op = silu_mul_quant
2025-05-07T20:32:49.9608094Z             if compiled:
2025-05-07T20:32:49.9608356Z                 op = torch.compile(op)
2025-05-07T20:32:49.9608671Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.9608970Z     
2025-05-07T20:32:49.9609166Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.9609347Z 
2025-05-07T20:32:49.9609453Z moe/activation_test.py:117: 
2025-05-07T20:32:49.9609773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.9610123Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.9610418Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.9611157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.9611893Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.9612463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.9613206Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.9613903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.9614638Z     kernel = self.compile(
2025-05-07T20:32:49.9615231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.9615937Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.9616369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.9616611Z 
2025-05-07T20:32:49.9616838Z self = <triton.compiler.compiler.ASTSource object at 0x7f127adef740>
2025-05-07T20:32:49.9617980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.9620270Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1279473a60>}
2025-05-07T20:32:49.9632978Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.9634122Z context = <triton._C.libtriton.ir.context object at 0x7f12787facb0>
2025-05-07T20:32:49.9634435Z 
2025-05-07T20:32:49.9634616Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.9635181Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.9635681Z                            module_map=module_map)
2025-05-07T20:32:49.9636059Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.9636440Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.9636722Z E       ^
2025-05-07T20:32:49.9637222Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.9637907Z 
2025-05-07T20:32:49.9638359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.9638923Z 
2025-05-07T20:32:49.9639033Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:49.9639598Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:49.9640024Z     T=16384,
2025-05-07T20:32:49.9640237Z     D=5120,
2025-05-07T20:32:49.9640452Z     scale_ub=1200.0,
2025-05-07T20:32:49.9640678Z     contiguous=True,
2025-05-07T20:32:49.9640922Z     compiled=True,
2025-05-07T20:32:49.9641150Z )
2025-05-07T20:32:49.9641498Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:49.9642020Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:49.9642319Z 
2025-05-07T20:32:49.9642407Z     @given(
2025-05-07T20:32:49.9642657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:49.9642987Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:49.9643315Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:49.9643665Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:49.9644002Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:49.9644319Z     )
2025-05-07T20:32:49.9644689Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:49.9645165Z     def test_silu_mul_quant(
2025-05-07T20:32:49.9645417Z         self,
2025-05-07T20:32:49.9645633Z         T: int,
2025-05-07T20:32:49.9645849Z         D: int,
2025-05-07T20:32:49.9646075Z         scale_ub: Optional[float],
2025-05-07T20:32:49.9646368Z         contiguous: bool,
2025-05-07T20:32:49.9646625Z         compiled: bool,
2025-05-07T20:32:49.9646854Z     ) -> None:
2025-05-07T20:32:49.9647089Z         torch.manual_seed(2025)
2025-05-07T20:32:49.9647343Z     
2025-05-07T20:32:49.9647626Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:49.9647990Z     
2025-05-07T20:32:49.9648193Z         x_sign = torch.sign(x)
2025-05-07T20:32:49.9648486Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:49.9648811Z         x = x_sign * x_clamp
2025-05-07T20:32:49.9649070Z         x0 = x[:, :D]
2025-05-07T20:32:49.9649287Z         x1 = x[:, D:]
2025-05-07T20:32:49.9649504Z     
2025-05-07T20:32:49.9649702Z         if contiguous:
2025-05-07T20:32:49.9649941Z             x0 = x0.contiguous()
2025-05-07T20:32:49.9650213Z             x1 = x1.contiguous()
2025-05-07T20:32:49.9650467Z     
2025-05-07T20:32:49.9650668Z         if scale_ub is not None:
2025-05-07T20:32:49.9650946Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:49.9651295Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:49.9651626Z             )
2025-05-07T20:32:49.9651818Z         else:
2025-05-07T20:32:49.9652045Z             scale_ub_tensor = None
2025-05-07T20:32:49.9652308Z     
2025-05-07T20:32:49.9652552Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:49.9652884Z             op = silu_mul_quant
2025-05-07T20:32:49.9653148Z             if compiled:
2025-05-07T20:32:49.9653399Z                 op = torch.compile(op)
2025-05-07T20:32:49.9653720Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.9654015Z     
2025-05-07T20:32:49.9654208Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:49.9654487Z 
2025-05-07T20:32:49.9654591Z moe/activation_test.py:117: 
2025-05-07T20:32:49.9654911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.9655264Z moe/activation_test.py:115: in fn
2025-05-07T20:32:49.9655555Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:49.9656156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:49.9656758Z     return fn(*args, **kwargs)
2025-05-07T20:32:49.9658069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:49.9658815Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:49.9659385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:49.9660192Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:49.9660894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:49.9661467Z     kernel = self.compile(
2025-05-07T20:32:49.9662046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:49.9662739Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:49.9663185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:49.9663461Z 
2025-05-07T20:32:49.9663684Z self = <triton.compiler.compiler.ASTSource object at 0x7f127b2f4470>
2025-05-07T20:32:49.9664822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:49.9666282Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127919efc0>}
2025-05-07T20:32:49.9667702Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:49.9668801Z context = <triton._C.libtriton.ir.context object at 0x7f12787b9ef0>
2025-05-07T20:32:49.9669113Z 
2025-05-07T20:32:49.9669283Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:49.9669837Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:49.9670323Z                            module_map=module_map)
2025-05-07T20:32:49.9670705Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:49.9671081Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:49.9671345Z E       ^
2025-05-07T20:32:49.9671838Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:49.9672324Z 
2025-05-07T20:32:49.9672761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:49.9673307Z 
2025-05-07T20:32:50.1373424Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.1374132Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.1374761Z     T=16384,
2025-05-07T20:32:50.1374969Z     D=5120,
2025-05-07T20:32:50.1375212Z     scale_ub=None,
2025-05-07T20:32:50.1375447Z     contiguous=False,
2025-05-07T20:32:50.1375686Z     compiled=True,
2025-05-07T20:32:50.1375911Z )
2025-05-07T20:32:50.1376258Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.1376799Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:50.1377108Z 
2025-05-07T20:32:50.1377196Z     @given(
2025-05-07T20:32:50.1377448Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.1377773Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.1378103Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.1378458Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.1378811Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.1379111Z     )
2025-05-07T20:32:50.1379483Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.1380269Z     def test_silu_mul_quant(
2025-05-07T20:32:50.1380519Z         self,
2025-05-07T20:32:50.1380728Z         T: int,
2025-05-07T20:32:50.1380940Z         D: int,
2025-05-07T20:32:50.1381162Z         scale_ub: Optional[float],
2025-05-07T20:32:50.1381449Z         contiguous: bool,
2025-05-07T20:32:50.1381863Z         compiled: bool,
2025-05-07T20:32:50.1382092Z     ) -> None:
2025-05-07T20:32:50.1382318Z         torch.manual_seed(2025)
2025-05-07T20:32:50.1382573Z     
2025-05-07T20:32:50.1382855Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.1383227Z     
2025-05-07T20:32:50.1383434Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.1383730Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.1384061Z         x = x_sign * x_clamp
2025-05-07T20:32:50.1384316Z         x0 = x[:, :D]
2025-05-07T20:32:50.1384547Z         x1 = x[:, D:]
2025-05-07T20:32:50.1384758Z     
2025-05-07T20:32:50.1384961Z         if contiguous:
2025-05-07T20:32:50.1385216Z             x0 = x0.contiguous()
2025-05-07T20:32:50.1385479Z             x1 = x1.contiguous()
2025-05-07T20:32:50.1385733Z     
2025-05-07T20:32:50.1385936Z         if scale_ub is not None:
2025-05-07T20:32:50.1386212Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.1386576Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.1386908Z             )
2025-05-07T20:32:50.1387099Z         else:
2025-05-07T20:32:50.1387324Z             scale_ub_tensor = None
2025-05-07T20:32:50.1387594Z     
2025-05-07T20:32:50.1387825Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.1388154Z             op = silu_mul_quant
2025-05-07T20:32:50.1388420Z             if compiled:
2025-05-07T20:32:50.1388670Z                 op = torch.compile(op)
2025-05-07T20:32:50.1388982Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.1389276Z     
2025-05-07T20:32:50.1389493Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.1389662Z 
2025-05-07T20:32:50.1389769Z moe/activation_test.py:117: 
2025-05-07T20:32:50.1390083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.1390436Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.1390723Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.1391327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.1391929Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.1392630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.1393358Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.1393925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.1394652Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.1395368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.1395930Z     kernel = self.compile(
2025-05-07T20:32:50.1396500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.1397205Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.1397614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.1397863Z 
2025-05-07T20:32:50.1398077Z self = <triton.compiler.compiler.ASTSource object at 0x7f1295a4d0d0>
2025-05-07T20:32:50.1399212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.1400755Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127a003a60>}
2025-05-07T20:32:50.1402190Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.1403441Z context = <triton._C.libtriton.ir.context object at 0x7f127895adb0>
2025-05-07T20:32:50.1403758Z 
2025-05-07T20:32:50.1403933Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.1404493Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.1405039Z                            module_map=module_map)
2025-05-07T20:32:50.1405491Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.1405867Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.1406146Z E       ^
2025-05-07T20:32:50.1406634Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.1407128Z 
2025-05-07T20:32:50.1407570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.1408132Z 
2025-05-07T20:32:50.1408236Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.1408669Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.1409088Z     T=2048,
2025-05-07T20:32:50.1409288Z     D=5120,
2025-05-07T20:32:50.1409488Z     scale_ub=None,
2025-05-07T20:32:50.1409704Z     contiguous=False,
2025-05-07T20:32:50.1409943Z     compiled=True,
2025-05-07T20:32:50.1410156Z )
2025-05-07T20:32:50.4655036Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.4655597Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:50.4655981Z 
2025-05-07T20:32:50.4656092Z     @given(
2025-05-07T20:32:50.4656431Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.4656915Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.4657344Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.4657805Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.4658252Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.4658623Z     )
2025-05-07T20:32:50.4659031Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.4659483Z     def test_silu_mul_quant(
2025-05-07T20:32:50.4659722Z         self,
2025-05-07T20:32:50.4659907Z         T: int,
2025-05-07T20:32:50.4660097Z         D: int,
2025-05-07T20:32:50.4660306Z         scale_ub: Optional[float],
2025-05-07T20:32:50.4660572Z         contiguous: bool,
2025-05-07T20:32:50.4660807Z         compiled: bool,
2025-05-07T20:32:50.4661022Z     ) -> None:
2025-05-07T20:32:50.4661225Z         torch.manual_seed(2025)
2025-05-07T20:32:50.4661460Z     
2025-05-07T20:32:50.4661736Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.4662081Z     
2025-05-07T20:32:50.4662267Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.4662552Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.4662864Z         x = x_sign * x_clamp
2025-05-07T20:32:50.4663122Z         x0 = x[:, :D]
2025-05-07T20:32:50.4663327Z         x1 = x[:, D:]
2025-05-07T20:32:50.4663525Z     
2025-05-07T20:32:50.4663699Z         if contiguous:
2025-05-07T20:32:50.4663920Z             x0 = x0.contiguous()
2025-05-07T20:32:50.4664176Z             x1 = x1.contiguous()
2025-05-07T20:32:50.4664412Z     
2025-05-07T20:32:50.4664593Z         if scale_ub is not None:
2025-05-07T20:32:50.4664859Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.4665191Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.4665496Z             )
2025-05-07T20:32:50.4665678Z         else:
2025-05-07T20:32:50.4666084Z             scale_ub_tensor = None
2025-05-07T20:32:50.4666340Z     
2025-05-07T20:32:50.4666564Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.4666877Z             op = silu_mul_quant
2025-05-07T20:32:50.4667120Z             if compiled:
2025-05-07T20:32:50.4667494Z                 op = torch.compile(op)
2025-05-07T20:32:50.4667788Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.4668062Z     
2025-05-07T20:32:50.4668242Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.4668407Z 
2025-05-07T20:32:50.4668509Z moe/activation_test.py:117: 
2025-05-07T20:32:50.4668803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.4669142Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.4669417Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.4669989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.4670568Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.4671253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.4671966Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.4672516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.4673229Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.4673957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.4674507Z     kernel = self.compile(
2025-05-07T20:32:50.4675060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.4675738Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.4676138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.4676386Z 
2025-05-07T20:32:50.4676591Z self = <triton.compiler.compiler.ASTSource object at 0x7f127b2f7da0>
2025-05-07T20:32:50.4677701Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.4679125Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1279cf6d40>}
2025-05-07T20:32:50.4680521Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.4681594Z context = <triton._C.libtriton.ir.context object at 0x7f121bca6c30>
2025-05-07T20:32:50.4681893Z 
2025-05-07T20:32:50.4682062Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.4682596Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.4683068Z                            module_map=module_map)
2025-05-07T20:32:50.4683435Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.4683791Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.4684044Z E       ^
2025-05-07T20:32:50.4684517Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.4684988Z 
2025-05-07T20:32:50.4685421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.4685960Z 
2025-05-07T20:32:50.4686062Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.4686474Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.4686996Z     T=2048,
2025-05-07T20:32:50.4687180Z     D=5120,
2025-05-07T20:32:50.4687363Z     scale_ub=1200.0,
2025-05-07T20:32:50.4687577Z     contiguous=False,
2025-05-07T20:32:50.4687795Z     compiled=True,
2025-05-07T20:32:50.4687985Z )
2025-05-07T20:32:50.4688388Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.4688897Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:50.4689182Z 
2025-05-07T20:32:50.4689263Z     @given(
2025-05-07T20:32:50.4689479Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.4689788Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.4690089Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.4690411Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.4690736Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.4691018Z     )
2025-05-07T20:32:50.4691366Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.4691812Z     def test_silu_mul_quant(
2025-05-07T20:32:50.4692045Z         self,
2025-05-07T20:32:50.4692235Z         T: int,
2025-05-07T20:32:50.4692422Z         D: int,
2025-05-07T20:32:50.4692628Z         scale_ub: Optional[float],
2025-05-07T20:32:50.4692899Z         contiguous: bool,
2025-05-07T20:32:50.4693126Z         compiled: bool,
2025-05-07T20:32:50.4693335Z     ) -> None:
2025-05-07T20:32:50.4693539Z         torch.manual_seed(2025)
2025-05-07T20:32:50.4693772Z     
2025-05-07T20:32:50.4694037Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.4694459Z     
2025-05-07T20:32:50.4694643Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.4694927Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.4695235Z         x = x_sign * x_clamp
2025-05-07T20:32:50.4695471Z         x0 = x[:, :D]
2025-05-07T20:32:50.4695686Z         x1 = x[:, D:]
2025-05-07T20:32:50.4695901Z     
2025-05-07T20:32:50.4696087Z         if contiguous:
2025-05-07T20:32:50.4696317Z             x0 = x0.contiguous()
2025-05-07T20:32:50.4696583Z             x1 = x1.contiguous()
2025-05-07T20:32:50.4696828Z     
2025-05-07T20:32:50.4697021Z         if scale_ub is not None:
2025-05-07T20:32:50.4697304Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.4697641Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.4697953Z             )
2025-05-07T20:32:50.4698154Z         else:
2025-05-07T20:32:50.4698370Z             scale_ub_tensor = None
2025-05-07T20:32:50.4698627Z     
2025-05-07T20:32:50.4698863Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.4699183Z             op = silu_mul_quant
2025-05-07T20:32:50.4699430Z             if compiled:
2025-05-07T20:32:50.4699681Z                 op = torch.compile(op)
2025-05-07T20:32:50.4699984Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.4700259Z     
2025-05-07T20:32:50.4700460Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.4700623Z 
2025-05-07T20:32:50.4700727Z moe/activation_test.py:117: 
2025-05-07T20:32:50.4701023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.4701363Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.4701658Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.4702241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.4702825Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.4703515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.4704245Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.4704801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.4705607Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.4706311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.4706870Z     kernel = self.compile(
2025-05-07T20:32:50.4707430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.4708204Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.4708622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.4708863Z 
2025-05-07T20:32:50.4709081Z self = <triton.compiler.compiler.ASTSource object at 0x7f12967d89e0>
2025-05-07T20:32:50.4710200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.4711645Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ac1a200>}
2025-05-07T20:32:50.4713057Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.4714151Z context = <triton._C.libtriton.ir.context object at 0x7f12789d8d70>
2025-05-07T20:32:50.4714447Z 
2025-05-07T20:32:50.4714616Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.4715153Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.4715637Z                            module_map=module_map)
2025-05-07T20:32:50.4716004Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.4716357Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.4716622Z E       ^
2025-05-07T20:32:50.4717103Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.4717574Z 
2025-05-07T20:32:50.4718010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.4718567Z 
2025-05-07T20:32:50.6463208Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.6464633Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.6465346Z     T=4096,
2025-05-07T20:32:50.6465641Z     D=5120,
2025-05-07T20:32:50.6465947Z     scale_ub=1200.0,
2025-05-07T20:32:50.6466312Z     contiguous=True,
2025-05-07T20:32:50.6466656Z     compiled=True,
2025-05-07T20:32:50.6466989Z )
2025-05-07T20:32:50.6467549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.6468367Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:50.6468813Z 
2025-05-07T20:32:50.6468976Z     @given(
2025-05-07T20:32:50.6469315Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.6469818Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.6470315Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.6470876Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.6471412Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.6471884Z     )
2025-05-07T20:32:50.6483231Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.6484093Z     def test_silu_mul_quant(
2025-05-07T20:32:50.6484434Z         self,
2025-05-07T20:32:50.6484698Z         T: int,
2025-05-07T20:32:50.6484967Z         D: int,
2025-05-07T20:32:50.6485250Z         scale_ub: Optional[float],
2025-05-07T20:32:50.6485630Z         contiguous: bool,
2025-05-07T20:32:50.6485956Z         compiled: bool,
2025-05-07T20:32:50.6486259Z     ) -> None:
2025-05-07T20:32:50.6486964Z         torch.manual_seed(2025)
2025-05-07T20:32:50.6487322Z     
2025-05-07T20:32:50.6487700Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.6488212Z     
2025-05-07T20:32:50.6488502Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.6489201Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.6489703Z         x = x_sign * x_clamp
2025-05-07T20:32:50.6490066Z         x0 = x[:, :D]
2025-05-07T20:32:50.6490368Z         x1 = x[:, D:]
2025-05-07T20:32:50.6490676Z     
2025-05-07T20:32:50.6490955Z         if contiguous:
2025-05-07T20:32:50.6491309Z             x0 = x0.contiguous()
2025-05-07T20:32:50.6491706Z             x1 = x1.contiguous()
2025-05-07T20:32:50.6492092Z     
2025-05-07T20:32:50.6492386Z         if scale_ub is not None:
2025-05-07T20:32:50.6492795Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.6493309Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.6493824Z             )
2025-05-07T20:32:50.6494138Z         else:
2025-05-07T20:32:50.6494596Z             scale_ub_tensor = None
2025-05-07T20:32:50.6495002Z     
2025-05-07T20:32:50.6495376Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.6495927Z             op = silu_mul_quant
2025-05-07T20:32:50.6496375Z             if compiled:
2025-05-07T20:32:50.6496759Z                 op = torch.compile(op)
2025-05-07T20:32:50.6497276Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.6497765Z     
2025-05-07T20:32:50.6498083Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.6498384Z 
2025-05-07T20:32:50.6498544Z moe/activation_test.py:117: 
2025-05-07T20:32:50.6499047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.6499626Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.6500105Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.6501116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.6502086Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.6503124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.6504362Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.6505234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.6506387Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.6507450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.6508270Z     kernel = self.compile(
2025-05-07T20:32:50.6509125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.6510218Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.6510847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.6511230Z 
2025-05-07T20:32:50.6511532Z self = <triton.compiler.compiler.ASTSource object at 0x7f12967da300>
2025-05-07T20:32:50.6513367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.6515809Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ac19080>}
2025-05-07T20:32:50.6518089Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.6519879Z context = <triton._C.libtriton.ir.context object at 0x7f121bcf0cb0>
2025-05-07T20:32:50.6520419Z 
2025-05-07T20:32:50.6520710Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.6521665Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.6522613Z                            module_map=module_map)
2025-05-07T20:32:50.6523253Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.6523911Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.6524357Z E       ^
2025-05-07T20:32:50.6525184Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.6526340Z 
2025-05-07T20:32:50.6527116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.6528090Z 
2025-05-07T20:32:50.6528269Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.6529030Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.6529746Z     T=128,
2025-05-07T20:32:50.6530078Z     D=5120,
2025-05-07T20:32:50.6530412Z     scale_ub=1200.0,
2025-05-07T20:32:50.6530791Z     contiguous=False,
2025-05-07T20:32:50.6531182Z     compiled=True,
2025-05-07T20:32:50.6531530Z )
2025-05-07T20:32:50.7544180Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.7545211Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:50.7545708Z 
2025-05-07T20:32:50.7545848Z     @given(
2025-05-07T20:32:50.7546262Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.7546819Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.7547337Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.7547924Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.7548503Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.7549012Z     )
2025-05-07T20:32:50.7549661Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.7550465Z     def test_silu_mul_quant(
2025-05-07T20:32:50.7550884Z         self,
2025-05-07T20:32:50.7551203Z         T: int,
2025-05-07T20:32:50.7551555Z         D: int,
2025-05-07T20:32:50.7551932Z         scale_ub: Optional[float],
2025-05-07T20:32:50.7552387Z         contiguous: bool,
2025-05-07T20:32:50.7552803Z         compiled: bool,
2025-05-07T20:32:50.7553197Z     ) -> None:
2025-05-07T20:32:50.7553553Z         torch.manual_seed(2025)
2025-05-07T20:32:50.7553984Z     
2025-05-07T20:32:50.7554454Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.7555048Z     
2025-05-07T20:32:50.7555352Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.7555816Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.7556330Z         x = x_sign * x_clamp
2025-05-07T20:32:50.7556703Z         x0 = x[:, :D]
2025-05-07T20:32:50.7557062Z         x1 = x[:, D:]
2025-05-07T20:32:50.7557401Z     
2025-05-07T20:32:50.7557694Z         if contiguous:
2025-05-07T20:32:50.7558074Z             x0 = x0.contiguous()
2025-05-07T20:32:50.7558541Z             x1 = x1.contiguous()
2025-05-07T20:32:50.7558962Z     
2025-05-07T20:32:50.7559318Z         if scale_ub is not None:
2025-05-07T20:32:50.7559799Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.7560370Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.7560912Z             )
2025-05-07T20:32:50.7561244Z         else:
2025-05-07T20:32:50.7561596Z             scale_ub_tensor = None
2025-05-07T20:32:50.7562035Z     
2025-05-07T20:32:50.7562427Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.7562959Z             op = silu_mul_quant
2025-05-07T20:32:50.7563400Z             if compiled:
2025-05-07T20:32:50.7563815Z                 op = torch.compile(op)
2025-05-07T20:32:50.7564320Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.7565257Z     
2025-05-07T20:32:50.7565596Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.7565881Z 
2025-05-07T20:32:50.7566060Z moe/activation_test.py:117: 
2025-05-07T20:32:50.7566572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.7567380Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.7567873Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.7568883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.7569922Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.7571147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.7572433Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.7573403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.7574771Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.7576004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.7576941Z     kernel = self.compile(
2025-05-07T20:32:50.7577704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.7578639Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.7579189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.7579508Z 
2025-05-07T20:32:50.7579788Z self = <triton.compiler.compiler.ASTSource object at 0x7f129532b020>
2025-05-07T20:32:50.7581318Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.7583307Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ad1f4c0>}
2025-05-07T20:32:50.7585332Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.7586884Z context = <triton._C.libtriton.ir.context object at 0x7f121bc867b0>
2025-05-07T20:32:50.7587332Z 
2025-05-07T20:32:50.7589007Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.7589783Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.7590487Z                            module_map=module_map)
2025-05-07T20:32:50.7590991Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.7591544Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.7591916Z E       ^
2025-05-07T20:32:50.7592628Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.7593375Z 
2025-05-07T20:32:50.7594067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.7594924Z 
2025-05-07T20:32:50.7595084Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.7595744Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.7596376Z     T=16384,
2025-05-07T20:32:50.7596679Z     D=7168,
2025-05-07T20:32:50.7596982Z     scale_ub=1200.0,
2025-05-07T20:32:50.7597329Z     contiguous=True,
2025-05-07T20:32:50.7597671Z     compiled=True,
2025-05-07T20:32:50.7597994Z )
2025-05-07T20:32:50.7598496Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.7599420Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:50.7599884Z 
2025-05-07T20:32:50.7600005Z     @given(
2025-05-07T20:32:50.7600364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.7600852Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.7601471Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.7601995Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.7602505Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.7602959Z     )
2025-05-07T20:32:50.7603550Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.7604284Z     def test_silu_mul_quant(
2025-05-07T20:32:50.7604653Z         self,
2025-05-07T20:32:50.7604965Z         T: int,
2025-05-07T20:32:50.7605275Z         D: int,
2025-05-07T20:32:50.7605610Z         scale_ub: Optional[float],
2025-05-07T20:32:50.7606037Z         contiguous: bool,
2025-05-07T20:32:50.7606416Z         compiled: bool,
2025-05-07T20:32:50.7606769Z     ) -> None:
2025-05-07T20:32:50.7607111Z         torch.manual_seed(2025)
2025-05-07T20:32:50.7607499Z     
2025-05-07T20:32:50.7607913Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.7608461Z     
2025-05-07T20:32:50.7608785Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.7609232Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.7609725Z         x = x_sign * x_clamp
2025-05-07T20:32:50.7610099Z         x0 = x[:, :D]
2025-05-07T20:32:50.7610423Z         x1 = x[:, D:]
2025-05-07T20:32:50.7610753Z     
2025-05-07T20:32:50.7611041Z         if contiguous:
2025-05-07T20:32:50.7611408Z             x0 = x0.contiguous()
2025-05-07T20:32:50.7611816Z             x1 = x1.contiguous()
2025-05-07T20:32:50.7612201Z     
2025-05-07T20:32:50.7612507Z         if scale_ub is not None:
2025-05-07T20:32:50.7612926Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.7613481Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.7613996Z             )
2025-05-07T20:32:50.7614283Z         else:
2025-05-07T20:32:50.7614741Z             scale_ub_tensor = None
2025-05-07T20:32:50.7615155Z     
2025-05-07T20:32:50.7615515Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.7616041Z             op = silu_mul_quant
2025-05-07T20:32:50.7616457Z             if compiled:
2025-05-07T20:32:50.7616851Z                 op = torch.compile(op)
2025-05-07T20:32:50.7617339Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.7617788Z     
2025-05-07T20:32:50.7618121Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.7618386Z 
2025-05-07T20:32:50.7618552Z moe/activation_test.py:117: 
2025-05-07T20:32:50.7619031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.7619598Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.7620041Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.7620962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:50.7621887Z     return fn(*args, **kwargs)
2025-05-07T20:32:50.7622958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.7624103Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.7624986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.7626521Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.7627599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.7628477Z     kernel = self.compile(
2025-05-07T20:32:50.7629361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.7630632Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.7631274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.7631655Z 
2025-05-07T20:32:50.7631975Z self = <triton.compiler.compiler.ASTSource object at 0x7f127949b3b0>
2025-05-07T20:32:50.7633944Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.7636226Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ad1f880>}
2025-05-07T20:32:50.7638435Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.7640134Z context = <triton._C.libtriton.ir.context object at 0x7f121bd0f530>
2025-05-07T20:32:50.7640611Z 
2025-05-07T20:32:50.7640873Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.7641717Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.7642474Z                            module_map=module_map)
2025-05-07T20:32:50.7643057Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.7643626Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.7644029Z E       ^
2025-05-07T20:32:50.7644782Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.7645532Z 
2025-05-07T20:32:50.7646215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.7647059Z 
2025-05-07T20:32:50.8862825Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.8863699Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.8864423Z     T=16384,
2025-05-07T20:32:50.8864753Z     D=5120,
2025-05-07T20:32:50.8865083Z     scale_ub=1200.0,
2025-05-07T20:32:50.8865464Z     contiguous=True,
2025-05-07T20:32:50.8865844Z     compiled=False,
2025-05-07T20:32:50.8866195Z )
2025-05-07T20:32:50.8866741Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.8867637Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:50.8868150Z 
2025-05-07T20:32:50.8868277Z     @given(
2025-05-07T20:32:50.8868667Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.8869207Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.8869741Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.8870325Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.8870908Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.8871417Z     )
2025-05-07T20:32:50.8872037Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.8872827Z     def test_silu_mul_quant(
2025-05-07T20:32:50.8873254Z         self,
2025-05-07T20:32:50.8873632Z         T: int,
2025-05-07T20:32:50.8873955Z         D: int,
2025-05-07T20:32:50.8874304Z         scale_ub: Optional[float],
2025-05-07T20:32:50.8874746Z         contiguous: bool,
2025-05-07T20:32:50.8875139Z         compiled: bool,
2025-05-07T20:32:50.8875494Z     ) -> None:
2025-05-07T20:32:50.8875850Z         torch.manual_seed(2025)
2025-05-07T20:32:50.8876243Z     
2025-05-07T20:32:50.8876679Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.8877268Z     
2025-05-07T20:32:50.8877597Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.8878088Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.8878635Z         x = x_sign * x_clamp
2025-05-07T20:32:50.8879469Z         x0 = x[:, :D]
2025-05-07T20:32:50.8879837Z         x1 = x[:, D:]
2025-05-07T20:32:50.8880198Z     
2025-05-07T20:32:50.8880515Z         if contiguous:
2025-05-07T20:32:50.8880900Z             x0 = x0.contiguous()
2025-05-07T20:32:50.8881589Z             x1 = x1.contiguous()
2025-05-07T20:32:50.8882001Z     
2025-05-07T20:32:50.8882330Z         if scale_ub is not None:
2025-05-07T20:32:50.8882800Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.8883369Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.8883913Z             )
2025-05-07T20:32:50.8884246Z         else:
2025-05-07T20:32:50.8884594Z             scale_ub_tensor = None
2025-05-07T20:32:50.8885036Z     
2025-05-07T20:32:50.8885430Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.8885978Z             op = silu_mul_quant
2025-05-07T20:32:50.8886411Z             if compiled:
2025-05-07T20:32:50.8886837Z                 op = torch.compile(op)
2025-05-07T20:32:50.8887371Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.8887848Z     
2025-05-07T20:32:50.8888173Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.8888460Z 
2025-05-07T20:32:50.8888638Z moe/activation_test.py:117: 
2025-05-07T20:32:50.8889159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.8889749Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.8890241Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.8891504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.8892788Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.8893825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.8895225Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.8896366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.8897120Z     kernel = self.compile(
2025-05-07T20:32:50.8897870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.8898809Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.8899358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.8899685Z 
2025-05-07T20:32:50.8899964Z self = <triton.compiler.compiler.ASTSource object at 0x7f12799cb7a0>
2025-05-07T20:32:50.8901536Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.8903648Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ad1c720>}
2025-05-07T20:32:50.8905693Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.8907238Z context = <triton._C.libtriton.ir.context object at 0x7f121bd0b1f0>
2025-05-07T20:32:50.8907674Z 
2025-05-07T20:32:50.8907904Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.8908702Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.8909441Z                            module_map=module_map)
2025-05-07T20:32:50.8910008Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.8910559Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.8910955Z E       ^
2025-05-07T20:32:50.8911842Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.8912593Z 
2025-05-07T20:32:50.8913272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.8914213Z 
2025-05-07T20:32:50.8914381Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:50.8915024Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:50.8915656Z     T=1,
2025-05-07T20:32:50.8915941Z     D=7168,
2025-05-07T20:32:50.8916234Z     scale_ub=1200.0,
2025-05-07T20:32:50.8916581Z     contiguous=False,
2025-05-07T20:32:50.8916929Z     compiled=False,
2025-05-07T20:32:50.8917235Z )
2025-05-07T20:32:50.8917731Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:50.8918508Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:50.8918933Z 
2025-05-07T20:32:50.8919057Z     @given(
2025-05-07T20:32:50.8919404Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:50.8919893Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:50.8920372Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:50.8920880Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:50.8921402Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:50.8921854Z     )
2025-05-07T20:32:50.8922394Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:50.8923100Z     def test_silu_mul_quant(
2025-05-07T20:32:50.8923483Z         self,
2025-05-07T20:32:50.8923781Z         T: int,
2025-05-07T20:32:50.8924075Z         D: int,
2025-05-07T20:32:50.8924409Z         scale_ub: Optional[float],
2025-05-07T20:32:50.8924833Z         contiguous: bool,
2025-05-07T20:32:50.8925172Z         compiled: bool,
2025-05-07T20:32:50.8925739Z     ) -> None:
2025-05-07T20:32:50.8926075Z         torch.manual_seed(2025)
2025-05-07T20:32:50.8926426Z     
2025-05-07T20:32:50.8926865Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:50.8927410Z     
2025-05-07T20:32:50.8927710Z         x_sign = torch.sign(x)
2025-05-07T20:32:50.8928179Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:50.8928679Z         x = x_sign * x_clamp
2025-05-07T20:32:50.8929042Z         x0 = x[:, :D]
2025-05-07T20:32:50.8929397Z         x1 = x[:, D:]
2025-05-07T20:32:50.8929717Z     
2025-05-07T20:32:50.8929988Z         if contiguous:
2025-05-07T20:32:50.8930358Z             x0 = x0.contiguous()
2025-05-07T20:32:50.8930802Z             x1 = x1.contiguous()
2025-05-07T20:32:50.8931212Z     
2025-05-07T20:32:50.8931543Z         if scale_ub is not None:
2025-05-07T20:32:50.8932024Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:50.8932610Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:50.8933145Z             )
2025-05-07T20:32:50.8933474Z         else:
2025-05-07T20:32:50.8933850Z             scale_ub_tensor = None
2025-05-07T20:32:50.8934283Z     
2025-05-07T20:32:50.8934755Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:50.8935324Z             op = silu_mul_quant
2025-05-07T20:32:50.8935756Z             if compiled:
2025-05-07T20:32:50.8936195Z                 op = torch.compile(op)
2025-05-07T20:32:50.8936711Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.8937186Z     
2025-05-07T20:32:50.8937512Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:50.8937800Z 
2025-05-07T20:32:50.8937984Z moe/activation_test.py:117: 
2025-05-07T20:32:50.8938492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.8939083Z moe/activation_test.py:115: in fn
2025-05-07T20:32:50.8939568Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:50.8940839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:50.8942367Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:50.8943376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:50.8944678Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:50.8946054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:50.8947031Z     kernel = self.compile(
2025-05-07T20:32:50.8948010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:50.8961172Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.8961928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:50.8962350Z 
2025-05-07T20:32:50.8962717Z self = <triton.compiler.compiler.ASTSource object at 0x7f12799c8bc0>
2025-05-07T20:32:50.8964718Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:50.8967276Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127b2f05e0>}
2025-05-07T20:32:50.8969797Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:50.8971703Z context = <triton._C.libtriton.ir.context object at 0x7f121bd858b0>
2025-05-07T20:32:50.8972227Z 
2025-05-07T20:32:50.8972516Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:50.8973465Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.8974320Z                            module_map=module_map)
2025-05-07T20:32:50.8975050Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.8975659Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.8976123Z E       ^
2025-05-07T20:32:50.8976945Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.8977783Z 
2025-05-07T20:32:50.8978550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:50.8979524Z 
2025-05-07T20:32:51.0715373Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.0716192Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.0716877Z     T=4096,
2025-05-07T20:32:51.0717189Z     D=7168,
2025-05-07T20:32:51.0717501Z     scale_ub=1200.0,
2025-05-07T20:32:51.0717869Z     contiguous=False,
2025-05-07T20:32:51.0718280Z     compiled=True,
2025-05-07T20:32:51.0718608Z )
2025-05-07T20:32:51.0719131Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.0719952Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.0720398Z 
2025-05-07T20:32:51.0720526Z     @given(
2025-05-07T20:32:51.0720876Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.0721394Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.0721898Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.0722436Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.0722976Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.0723457Z     )
2025-05-07T20:32:51.0724044Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.0724796Z     def test_silu_mul_quant(
2025-05-07T20:32:51.0725199Z         self,
2025-05-07T20:32:51.0725792Z         T: int,
2025-05-07T20:32:51.0727071Z         D: int,
2025-05-07T20:32:51.0727448Z         scale_ub: Optional[float],
2025-05-07T20:32:51.0727889Z         contiguous: bool,
2025-05-07T20:32:51.0728284Z         compiled: bool,
2025-05-07T20:32:51.0728650Z     ) -> None:
2025-05-07T20:32:51.0729242Z         torch.manual_seed(2025)
2025-05-07T20:32:51.0729634Z     
2025-05-07T20:32:51.0730068Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.0730639Z     
2025-05-07T20:32:51.0730945Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.0731418Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.0731930Z         x = x_sign * x_clamp
2025-05-07T20:32:51.0732313Z         x0 = x[:, :D]
2025-05-07T20:32:51.0732664Z         x1 = x[:, D:]
2025-05-07T20:32:51.0733003Z     
2025-05-07T20:32:51.0733292Z         if contiguous:
2025-05-07T20:32:51.0733651Z             x0 = x0.contiguous()
2025-05-07T20:32:51.0734075Z             x1 = x1.contiguous()
2025-05-07T20:32:51.0734645Z     
2025-05-07T20:32:51.0734966Z         if scale_ub is not None:
2025-05-07T20:32:51.0735424Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.0735967Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.0736437Z             )
2025-05-07T20:32:51.0736701Z         else:
2025-05-07T20:32:51.0736973Z             scale_ub_tensor = None
2025-05-07T20:32:51.0737331Z     
2025-05-07T20:32:51.0737652Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.0738095Z             op = silu_mul_quant
2025-05-07T20:32:51.0738444Z             if compiled:
2025-05-07T20:32:51.0738798Z                 op = torch.compile(op)
2025-05-07T20:32:51.0739235Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.0739639Z     
2025-05-07T20:32:51.0739935Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.0740191Z 
2025-05-07T20:32:51.0740355Z moe/activation_test.py:117: 
2025-05-07T20:32:51.0740822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.0741327Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.0741766Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.0742640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.0743619Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.0744698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.0745867Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.0746771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.0748005Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.0749230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.0750199Z     kernel = self.compile(
2025-05-07T20:32:51.0751171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.0752354Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.0753042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.0753413Z 
2025-05-07T20:32:51.0753780Z self = <triton.compiler.compiler.ASTSource object at 0x7f127a00af30>
2025-05-07T20:32:51.0755609Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.0757926Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127b0aafc0>}
2025-05-07T20:32:51.0760396Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.0762167Z context = <triton._C.libtriton.ir.context object at 0x7f121be22db0>
2025-05-07T20:32:51.0762775Z 
2025-05-07T20:32:51.0763068Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.0763986Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.0764740Z                            module_map=module_map)
2025-05-07T20:32:51.0765268Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.0765848Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.0766304Z E       ^
2025-05-07T20:32:51.0767151Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.0767986Z 
2025-05-07T20:32:51.0768760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.0769735Z 
2025-05-07T20:32:51.0769914Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.0770650Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.0771384Z     T=128,
2025-05-07T20:32:51.0771688Z     D=7168,
2025-05-07T20:32:51.0772021Z     scale_ub=1200.0,
2025-05-07T20:32:51.0772404Z     contiguous=False,
2025-05-07T20:32:51.0772780Z     compiled=True,
2025-05-07T20:32:51.0773124Z )
2025-05-07T20:32:51.1696198Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.1697171Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:51.1697673Z 
2025-05-07T20:32:51.1697811Z     @given(
2025-05-07T20:32:51.1698194Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.1698752Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.1699312Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.1699904Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.1700480Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.1700994Z     )
2025-05-07T20:32:51.1701610Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.1702393Z     def test_silu_mul_quant(
2025-05-07T20:32:51.1702811Z         self,
2025-05-07T20:32:51.1703135Z         T: int,
2025-05-07T20:32:51.1703477Z         D: int,
2025-05-07T20:32:51.1703873Z         scale_ub: Optional[float],
2025-05-07T20:32:51.1704347Z         contiguous: bool,
2025-05-07T20:32:51.1704748Z         compiled: bool,
2025-05-07T20:32:51.1705133Z     ) -> None:
2025-05-07T20:32:51.1705498Z         torch.manual_seed(2025)
2025-05-07T20:32:51.1705910Z     
2025-05-07T20:32:51.1706378Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.1706986Z     
2025-05-07T20:32:51.1707306Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.1707782Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.1708291Z         x = x_sign * x_clamp
2025-05-07T20:32:51.1708685Z         x0 = x[:, :D]
2025-05-07T20:32:51.1709030Z         x1 = x[:, D:]
2025-05-07T20:32:51.1709368Z     
2025-05-07T20:32:51.1709665Z         if contiguous:
2025-05-07T20:32:51.1710027Z             x0 = x0.contiguous()
2025-05-07T20:32:51.1710456Z             x1 = x1.contiguous()
2025-05-07T20:32:51.1710871Z     
2025-05-07T20:32:51.1711185Z         if scale_ub is not None:
2025-05-07T20:32:51.1711652Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.1712228Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.1712752Z             )
2025-05-07T20:32:51.1713078Z         else:
2025-05-07T20:32:51.1713426Z             scale_ub_tensor = None
2025-05-07T20:32:51.1713854Z     
2025-05-07T20:32:51.1714658Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.1715210Z             op = silu_mul_quant
2025-05-07T20:32:51.1715634Z             if compiled:
2025-05-07T20:32:51.1716059Z                 op = torch.compile(op)
2025-05-07T20:32:51.1716563Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.1717269Z     
2025-05-07T20:32:51.1717587Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.1717881Z 
2025-05-07T20:32:51.1718053Z moe/activation_test.py:117: 
2025-05-07T20:32:51.1718568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.1719147Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.1719639Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.1720661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.1721679Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.1722899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.1724178Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.1725156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.1726765Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.1727988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.1728968Z     kernel = self.compile(
2025-05-07T20:32:51.1729933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.1730859Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.1731417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.1731737Z 
2025-05-07T20:32:51.1732048Z self = <triton.compiler.compiler.ASTSource object at 0x7f127a0f65a0>
2025-05-07T20:32:51.1733562Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.1735750Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f129675d800>}
2025-05-07T20:32:51.1737791Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.1739334Z context = <triton._C.libtriton.ir.context object at 0x7f121b8f4d30>
2025-05-07T20:32:51.1739778Z 
2025-05-07T20:32:51.1740060Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.1740866Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.1741606Z                            module_map=module_map)
2025-05-07T20:32:51.1742158Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.1742628Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.1742993Z E       ^
2025-05-07T20:32:51.1743692Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.1744351Z 
2025-05-07T20:32:51.1744976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.1745740Z 
2025-05-07T20:32:51.1745887Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.1746487Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.1747080Z     T=2048,
2025-05-07T20:32:51.1747336Z     D=7168,
2025-05-07T20:32:51.1747607Z     scale_ub=None,
2025-05-07T20:32:51.1748109Z     contiguous=True,
2025-05-07T20:32:51.1748425Z     compiled=True,
2025-05-07T20:32:51.1748716Z )
2025-05-07T20:32:51.1749173Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.1749879Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:51.1750436Z 
2025-05-07T20:32:51.1750544Z     @given(
2025-05-07T20:32:51.1750865Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.1751314Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.1751746Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.1752221Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.1752694Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.1753098Z     )
2025-05-07T20:32:51.1753621Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.1754296Z     def test_silu_mul_quant(
2025-05-07T20:32:51.1754645Z         self,
2025-05-07T20:32:51.1754919Z         T: int,
2025-05-07T20:32:51.1755208Z         D: int,
2025-05-07T20:32:51.1755516Z         scale_ub: Optional[float],
2025-05-07T20:32:51.1755894Z         contiguous: bool,
2025-05-07T20:32:51.1756239Z         compiled: bool,
2025-05-07T20:32:51.1756564Z     ) -> None:
2025-05-07T20:32:51.1756857Z         torch.manual_seed(2025)
2025-05-07T20:32:51.1757205Z     
2025-05-07T20:32:51.1757587Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.1758073Z     
2025-05-07T20:32:51.1758347Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.1758756Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.1759191Z         x = x_sign * x_clamp
2025-05-07T20:32:51.1759533Z         x0 = x[:, :D]
2025-05-07T20:32:51.1759842Z         x1 = x[:, D:]
2025-05-07T20:32:51.1760127Z     
2025-05-07T20:32:51.1760403Z         if contiguous:
2025-05-07T20:32:51.1760736Z             x0 = x0.contiguous()
2025-05-07T20:32:51.1761106Z             x1 = x1.contiguous()
2025-05-07T20:32:51.1761468Z     
2025-05-07T20:32:51.1761740Z         if scale_ub is not None:
2025-05-07T20:32:51.1762118Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.1762589Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.1763038Z             )
2025-05-07T20:32:51.1763306Z         else:
2025-05-07T20:32:51.1763591Z             scale_ub_tensor = None
2025-05-07T20:32:51.1763947Z     
2025-05-07T20:32:51.1764271Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.1764708Z             op = silu_mul_quant
2025-05-07T20:32:51.1765069Z             if compiled:
2025-05-07T20:32:51.1765422Z                 op = torch.compile(op)
2025-05-07T20:32:51.1765835Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.1766231Z     
2025-05-07T20:32:51.1766501Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.1766733Z 
2025-05-07T20:32:51.1766869Z moe/activation_test.py:117: 
2025-05-07T20:32:51.1767299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.1767772Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.1768167Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.1768977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.1769812Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.1770784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.1771805Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.1772597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.1773651Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.1774848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.1775638Z     kernel = self.compile(
2025-05-07T20:32:51.1776429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.1777396Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.1778056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.1778404Z 
2025-05-07T20:32:51.1778689Z self = <triton.compiler.compiler.ASTSource object at 0x7f1279c58fe0>
2025-05-07T20:32:51.1780268Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.1782292Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1278524400>}
2025-05-07T20:32:51.1784282Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.1785752Z context = <triton._C.libtriton.ir.context object at 0x7f121ba9b430>
2025-05-07T20:32:51.1786190Z 
2025-05-07T20:32:51.1786424Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.1787190Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.1787872Z                            module_map=module_map)
2025-05-07T20:32:51.1788390Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.1788907Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.1789294Z E       ^
2025-05-07T20:32:51.1789995Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.1790682Z 
2025-05-07T20:32:51.1791289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.1792171Z 
2025-05-07T20:32:51.2402863Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.2403637Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.2404368Z     T=16384,
2025-05-07T20:32:51.2404674Z     D=5120,
2025-05-07T20:32:51.2404977Z     scale_ub=None,
2025-05-07T20:32:51.2405321Z     contiguous=False,
2025-05-07T20:32:51.2405673Z     compiled=False,
2025-05-07T20:32:51.2406003Z )
2025-05-07T20:32:51.2406514Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.2407349Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.2407847Z 
2025-05-07T20:32:51.2407975Z     @given(
2025-05-07T20:32:51.2408342Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.2408855Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.2409358Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.2409917Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.2410471Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.2410942Z     )
2025-05-07T20:32:51.2411496Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.2412244Z     def test_silu_mul_quant(
2025-05-07T20:32:51.2412625Z         self,
2025-05-07T20:32:51.2412940Z         T: int,
2025-05-07T20:32:51.2413261Z         D: int,
2025-05-07T20:32:51.2413604Z         scale_ub: Optional[float],
2025-05-07T20:32:51.2414062Z         contiguous: bool,
2025-05-07T20:32:51.2414634Z         compiled: bool,
2025-05-07T20:32:51.2415004Z     ) -> None:
2025-05-07T20:32:51.2415363Z         torch.manual_seed(2025)
2025-05-07T20:32:51.2415771Z     
2025-05-07T20:32:51.2416588Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.2417205Z     
2025-05-07T20:32:51.2417535Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.2418023Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.2421759Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.2425364Z 
2025-05-07T20:32:51.2425935Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:51.2426293Z 
2025-05-07T20:32:51.2426455Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.2427151Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.2427785Z     T=4096,
2025-05-07T20:32:51.2428072Z     D=7168,
2025-05-07T20:32:51.2428370Z     scale_ub=1200.0,
2025-05-07T20:32:51.2428727Z     contiguous=True,
2025-05-07T20:32:51.2429078Z     compiled=True,
2025-05-07T20:32:51.2429406Z )
2025-05-07T20:32:51.2429928Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.2430715Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:51.2431175Z 
2025-05-07T20:32:51.2431301Z     @given(
2025-05-07T20:32:51.2431677Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.2432185Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.2432696Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.2433240Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.2433792Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.2434260Z     )
2025-05-07T20:32:51.2434844Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.2435623Z     def test_silu_mul_quant(
2025-05-07T20:32:51.2436031Z         self,
2025-05-07T20:32:51.2436352Z         T: int,
2025-05-07T20:32:51.2436679Z         D: int,
2025-05-07T20:32:51.2437053Z         scale_ub: Optional[float],
2025-05-07T20:32:51.2437508Z         contiguous: bool,
2025-05-07T20:32:51.2437915Z         compiled: bool,
2025-05-07T20:32:51.2438291Z     ) -> None:
2025-05-07T20:32:51.2438646Z         torch.manual_seed(2025)
2025-05-07T20:32:51.2439068Z     
2025-05-07T20:32:51.2439533Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.2440135Z     
2025-05-07T20:32:51.2440454Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.2440939Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.2444729Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.2448293Z 
2025-05-07T20:32:51.2448500Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:51.2448859Z 
2025-05-07T20:32:51.2449024Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.2449719Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.2450398Z     T=16384,
2025-05-07T20:32:51.2450705Z     D=7168,
2025-05-07T20:32:51.2451021Z     scale_ub=None,
2025-05-07T20:32:51.2451365Z     contiguous=False,
2025-05-07T20:32:51.2451954Z     compiled=False,
2025-05-07T20:32:51.2452311Z )
2025-05-07T20:32:51.2452860Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.2453744Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.2454520Z 
2025-05-07T20:32:51.2454656Z     @given(
2025-05-07T20:32:51.2455039Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.2455586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.2456110Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.2456692Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.2457274Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.2457757Z     )
2025-05-07T20:32:51.2458371Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.2459167Z     def test_silu_mul_quant(
2025-05-07T20:32:51.2459572Z         self,
2025-05-07T20:32:51.2459913Z         T: int,
2025-05-07T20:32:51.2460243Z         D: int,
2025-05-07T20:32:51.2460596Z         scale_ub: Optional[float],
2025-05-07T20:32:51.2461072Z         contiguous: bool,
2025-05-07T20:32:51.2461481Z         compiled: bool,
2025-05-07T20:32:51.2461867Z     ) -> None:
2025-05-07T20:32:51.2462239Z         torch.manual_seed(2025)
2025-05-07T20:32:51.2462646Z     
2025-05-07T20:32:51.2463106Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.2466993Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.2470538Z 
2025-05-07T20:32:51.2470720Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.2471014Z 
2025-05-07T20:32:51.2471168Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.2471787Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.2484419Z     T=2048,
2025-05-07T20:32:51.2484755Z     D=7168,
2025-05-07T20:32:51.2485051Z     scale_ub=1200.0,
2025-05-07T20:32:51.2485386Z     contiguous=True,
2025-05-07T20:32:51.2485737Z     compiled=True,
2025-05-07T20:32:51.2486054Z )
2025-05-07T20:32:51.2486519Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.2487251Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:51.2487693Z 
2025-05-07T20:32:51.2487812Z     @given(
2025-05-07T20:32:51.2488171Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.2488663Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.2489143Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.2489667Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.2490183Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.2490638Z     )
2025-05-07T20:32:51.2491184Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.2491908Z     def test_silu_mul_quant(
2025-05-07T20:32:51.2492278Z         self,
2025-05-07T20:32:51.2492584Z         T: int,
2025-05-07T20:32:51.2492884Z         D: int,
2025-05-07T20:32:51.2493210Z         scale_ub: Optional[float],
2025-05-07T20:32:51.2493637Z         contiguous: bool,
2025-05-07T20:32:51.2494012Z         compiled: bool,
2025-05-07T20:32:51.2494343Z     ) -> None:
2025-05-07T20:32:51.2494852Z         torch.manual_seed(2025)
2025-05-07T20:32:51.2495227Z     
2025-05-07T20:32:51.2495805Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.2496366Z     
2025-05-07T20:32:51.2496665Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.2497109Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.2500437Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.2503737Z 
2025-05-07T20:32:51.2503935Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:51.2504283Z 
2025-05-07T20:32:51.2504445Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.2505118Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.2505782Z     T=2048,
2025-05-07T20:32:51.2506099Z     D=7168,
2025-05-07T20:32:51.2506421Z     scale_ub=None,
2025-05-07T20:32:51.2506773Z     contiguous=True,
2025-05-07T20:32:51.2507156Z     compiled=False,
2025-05-07T20:32:51.2507502Z )
2025-05-07T20:32:51.3630647Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3631546Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.3632003Z 
2025-05-07T20:32:51.3632129Z     @given(
2025-05-07T20:32:51.3632506Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3633031Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3633545Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3634112Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3634655Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3635129Z     )
2025-05-07T20:32:51.3635707Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3636461Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3636868Z         self,
2025-05-07T20:32:51.3637196Z         T: int,
2025-05-07T20:32:51.3637519Z         D: int,
2025-05-07T20:32:51.3637872Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3638300Z         contiguous: bool,
2025-05-07T20:32:51.3638701Z         compiled: bool,
2025-05-07T20:32:51.3639076Z     ) -> None:
2025-05-07T20:32:51.3639423Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3639832Z     
2025-05-07T20:32:51.3640276Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3640840Z     
2025-05-07T20:32:51.3641152Z >       x_sign = torch.sign(x)
2025-05-07T20:32:51.3644625Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.3647877Z 
2025-05-07T20:32:51.3648088Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:51.3648433Z 
2025-05-07T20:32:51.3648613Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3649291Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3649943Z     T=1,
2025-05-07T20:32:51.3650192Z     D=7168,
2025-05-07T20:32:51.3650444Z     scale_ub=1200.0,
2025-05-07T20:32:51.3650757Z     contiguous=True,
2025-05-07T20:32:51.3651083Z     compiled=False,
2025-05-07T20:32:51.3651391Z )
2025-05-07T20:32:51.3652367Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.3653202Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:51.3653656Z 
2025-05-07T20:32:51.3653778Z     @given(
2025-05-07T20:32:51.3654138Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.3655000Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.3655505Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.3656007Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.3656541Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.3657019Z     )
2025-05-07T20:32:51.3657630Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.3658386Z     def test_silu_mul_quant(
2025-05-07T20:32:51.3658805Z         self,
2025-05-07T20:32:51.3659125Z         T: int,
2025-05-07T20:32:51.3659455Z         D: int,
2025-05-07T20:32:51.3659820Z         scale_ub: Optional[float],
2025-05-07T20:32:51.3660279Z         contiguous: bool,
2025-05-07T20:32:51.3660698Z         compiled: bool,
2025-05-07T20:32:51.3661082Z     ) -> None:
2025-05-07T20:32:51.3661441Z         torch.manual_seed(2025)
2025-05-07T20:32:51.3661870Z     
2025-05-07T20:32:51.3662333Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.3662923Z     
2025-05-07T20:32:51.3663230Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.3663746Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.3664230Z         x = x_sign * x_clamp
2025-05-07T20:32:51.3664631Z         x0 = x[:, :D]
2025-05-07T20:32:51.3665000Z         x1 = x[:, D:]
2025-05-07T20:32:51.3665336Z     
2025-05-07T20:32:51.3665628Z         if contiguous:
2025-05-07T20:32:51.3665995Z             x0 = x0.contiguous()
2025-05-07T20:32:51.3666406Z             x1 = x1.contiguous()
2025-05-07T20:32:51.3666783Z     
2025-05-07T20:32:51.3667096Z         if scale_ub is not None:
2025-05-07T20:32:51.3667541Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.3668057Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.3668544Z             )
2025-05-07T20:32:51.3668863Z         else:
2025-05-07T20:32:51.3669197Z             scale_ub_tensor = None
2025-05-07T20:32:51.3669624Z     
2025-05-07T20:32:51.3670008Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.3670533Z             op = silu_mul_quant
2025-05-07T20:32:51.3670936Z             if compiled:
2025-05-07T20:32:51.3671338Z                 op = torch.compile(op)
2025-05-07T20:32:51.3671823Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3672275Z     
2025-05-07T20:32:51.3672594Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.3672851Z 
2025-05-07T20:32:51.3673019Z moe/activation_test.py:117: 
2025-05-07T20:32:51.3673494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3674078Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.3674569Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.3675657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.3676824Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.3677816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.3679073Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.3680263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.3681246Z     kernel = self.compile(
2025-05-07T20:32:51.3682237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.3683440Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.3684293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.3684730Z 
2025-05-07T20:32:51.3685089Z self = <triton.compiler.compiler.ASTSource object at 0x7f127ad684d0>
2025-05-07T20:32:51.3687085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.3689570Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f121ba10f40>}
2025-05-07T20:32:51.3691998Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.3693952Z context = <triton._C.libtriton.ir.context object at 0x7f121bb28e30>
2025-05-07T20:32:51.3694605Z 
2025-05-07T20:32:51.3694896Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.3695838Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.3696672Z                            module_map=module_map)
2025-05-07T20:32:51.3697309Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.3697933Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.3698386Z E       ^
2025-05-07T20:32:51.3699235Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.3700085Z 
2025-05-07T20:32:51.3700857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.3701820Z 
2025-05-07T20:32:51.3702010Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.3702751Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.3703473Z     T=128,
2025-05-07T20:32:51.3703797Z     D=5120,
2025-05-07T20:32:51.3704118Z     scale_ub=None,
2025-05-07T20:32:51.3704486Z     contiguous=True,
2025-05-07T20:32:51.3704869Z     compiled=False,
2025-05-07T20:32:51.3705226Z )
2025-05-07T20:32:51.4384163Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4385072Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.4385530Z 
2025-05-07T20:32:51.4385656Z     @given(
2025-05-07T20:32:51.4386029Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4386542Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4387037Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4387579Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4388142Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4388626Z     )
2025-05-07T20:32:51.4389240Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4389985Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4390374Z         self,
2025-05-07T20:32:51.4390694Z         T: int,
2025-05-07T20:32:51.4391033Z         D: int,
2025-05-07T20:32:51.4391395Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4391834Z         contiguous: bool,
2025-05-07T20:32:51.4392210Z         compiled: bool,
2025-05-07T20:32:51.4392530Z     ) -> None:
2025-05-07T20:32:51.4392832Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4393200Z     
2025-05-07T20:32:51.4393627Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4394179Z     
2025-05-07T20:32:51.4394492Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4394956Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4395454Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4395848Z         x0 = x[:, :D]
2025-05-07T20:32:51.4396595Z         x1 = x[:, D:]
2025-05-07T20:32:51.4396949Z     
2025-05-07T20:32:51.4397243Z         if contiguous:
2025-05-07T20:32:51.4397644Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4398080Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4398494Z     
2025-05-07T20:32:51.4399074Z         if scale_ub is not None:
2025-05-07T20:32:51.4399550Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4400123Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4400667Z             )
2025-05-07T20:32:51.4400989Z         else:
2025-05-07T20:32:51.4401335Z             scale_ub_tensor = None
2025-05-07T20:32:51.4401769Z     
2025-05-07T20:32:51.4402160Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4402699Z             op = silu_mul_quant
2025-05-07T20:32:51.4403131Z             if compiled:
2025-05-07T20:32:51.4403544Z                 op = torch.compile(op)
2025-05-07T20:32:51.4404079Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4404536Z     
2025-05-07T20:32:51.4404843Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4405106Z 
2025-05-07T20:32:51.4405272Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4405781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4406345Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4406811Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4407960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4409138Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4410055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4411220Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4412374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4413313Z     kernel = self.compile(
2025-05-07T20:32:51.4414271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4415575Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4416292Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4416707Z 
2025-05-07T20:32:51.4417072Z self = <triton.compiler.compiler.ASTSource object at 0x7f127ad6a5a0>
2025-05-07T20:32:51.4419040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4421591Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f121ba12020>}
2025-05-07T20:32:51.4424155Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4426434Z context = <triton._C.libtriton.ir.context object at 0x7f121bb13130>
2025-05-07T20:32:51.4426951Z 
2025-05-07T20:32:51.4427230Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4428101Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4428901Z                            module_map=module_map)
2025-05-07T20:32:51.4429497Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4430093Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4430544Z E       ^
2025-05-07T20:32:51.4431380Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4432424Z 
2025-05-07T20:32:51.4433213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4434176Z 
2025-05-07T20:32:51.4434355Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4435248Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4435964Z     T=128,
2025-05-07T20:32:51.4436282Z     D=7168,
2025-05-07T20:32:51.4436595Z     scale_ub=None,
2025-05-07T20:32:51.4436953Z     contiguous=True,
2025-05-07T20:32:51.4437331Z     compiled=False,
2025-05-07T20:32:51.4437677Z )
2025-05-07T20:32:51.4438236Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.4439117Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.4439602Z 
2025-05-07T20:32:51.4439733Z     @given(
2025-05-07T20:32:51.4440121Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.4440682Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.4441228Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.4441798Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.4442380Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.4442892Z     )
2025-05-07T20:32:51.4443502Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.4444345Z     def test_silu_mul_quant(
2025-05-07T20:32:51.4444760Z         self,
2025-05-07T20:32:51.4445074Z         T: int,
2025-05-07T20:32:51.4445402Z         D: int,
2025-05-07T20:32:51.4445771Z         scale_ub: Optional[float],
2025-05-07T20:32:51.4446230Z         contiguous: bool,
2025-05-07T20:32:51.4446635Z         compiled: bool,
2025-05-07T20:32:51.4447014Z     ) -> None:
2025-05-07T20:32:51.4447363Z         torch.manual_seed(2025)
2025-05-07T20:32:51.4447779Z     
2025-05-07T20:32:51.4448250Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.4448853Z     
2025-05-07T20:32:51.4449140Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.4449537Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.4449955Z         x = x_sign * x_clamp
2025-05-07T20:32:51.4450292Z         x0 = x[:, :D]
2025-05-07T20:32:51.4450596Z         x1 = x[:, D:]
2025-05-07T20:32:51.4450903Z     
2025-05-07T20:32:51.4451168Z         if contiguous:
2025-05-07T20:32:51.4451503Z             x0 = x0.contiguous()
2025-05-07T20:32:51.4451891Z             x1 = x1.contiguous()
2025-05-07T20:32:51.4452213Z     
2025-05-07T20:32:51.4452489Z         if scale_ub is not None:
2025-05-07T20:32:51.4452895Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.4453400Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.4453956Z             )
2025-05-07T20:32:51.4454270Z         else:
2025-05-07T20:32:51.4454756Z             scale_ub_tensor = None
2025-05-07T20:32:51.4455155Z     
2025-05-07T20:32:51.4455543Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.4456090Z             op = silu_mul_quant
2025-05-07T20:32:51.4456494Z             if compiled:
2025-05-07T20:32:51.4456885Z                 op = torch.compile(op)
2025-05-07T20:32:51.4457392Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4457851Z     
2025-05-07T20:32:51.4458174Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.4458456Z 
2025-05-07T20:32:51.4458620Z moe/activation_test.py:117: 
2025-05-07T20:32:51.4459117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4459709Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.4460181Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.4461412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.4462609Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.4463627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.4464750Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.4465760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.4466762Z     kernel = self.compile(
2025-05-07T20:32:51.4467636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.4468726Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.4469351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.4469721Z 
2025-05-07T20:32:51.4470022Z self = <triton.compiler.compiler.ASTSource object at 0x7f127b3b0ad0>
2025-05-07T20:32:51.4471821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.4474193Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f121ba12f20>}
2025-05-07T20:32:51.4476439Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.4478136Z context = <triton._C.libtriton.ir.context object at 0x7f121bb178f0>
2025-05-07T20:32:51.4478611Z 
2025-05-07T20:32:51.4478873Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.4479709Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.4480456Z                            module_map=module_map)
2025-05-07T20:32:51.4481037Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.4481604Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.4482008Z E       ^
2025-05-07T20:32:51.4482764Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.4483535Z 
2025-05-07T20:32:51.4484297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.4485254Z 
2025-05-07T20:32:51.4485439Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.4486166Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.4486868Z     T=2048,
2025-05-07T20:32:51.4487187Z     D=7168,
2025-05-07T20:32:51.4487500Z     scale_ub=1200.0,
2025-05-07T20:32:51.4487876Z     contiguous=True,
2025-05-07T20:32:51.4488241Z     compiled=False,
2025-05-07T20:32:51.4488588Z )
2025-05-07T20:32:51.5299673Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.5300584Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:51.5301049Z 
2025-05-07T20:32:51.5301174Z     @given(
2025-05-07T20:32:51.5301558Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.5302071Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.5302560Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.5303105Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.5303664Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.5304130Z     )
2025-05-07T20:32:51.5304718Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.5305461Z     def test_silu_mul_quant(
2025-05-07T20:32:51.5305849Z         self,
2025-05-07T20:32:51.5306150Z         T: int,
2025-05-07T20:32:51.5306469Z         D: int,
2025-05-07T20:32:51.5307288Z         scale_ub: Optional[float],
2025-05-07T20:32:51.5307719Z         contiguous: bool,
2025-05-07T20:32:51.5308046Z         compiled: bool,
2025-05-07T20:32:51.5308364Z     ) -> None:
2025-05-07T20:32:51.5308657Z         torch.manual_seed(2025)
2025-05-07T20:32:51.5309010Z     
2025-05-07T20:32:51.5309699Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.5313262Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.5316852Z 
2025-05-07T20:32:51.5317082Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.5317468Z 
2025-05-07T20:32:51.5317638Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.5318373Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.5319094Z     T=1,
2025-05-07T20:32:51.5319388Z     D=5120,
2025-05-07T20:32:51.5319710Z     scale_ub=1200.0,
2025-05-07T20:32:51.5320068Z     contiguous=True,
2025-05-07T20:32:51.5320406Z     compiled=False,
2025-05-07T20:32:51.5320763Z )
2025-05-07T20:32:51.5321315Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.5322110Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:51.5322541Z 
2025-05-07T20:32:51.5322658Z     @given(
2025-05-07T20:32:51.5323019Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.5323542Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.5324072Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.5324631Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.5325165Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.5325978Z     )
2025-05-07T20:32:51.5326558Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.5327312Z     def test_silu_mul_quant(
2025-05-07T20:32:51.5327699Z         self,
2025-05-07T20:32:51.5328011Z         T: int,
2025-05-07T20:32:51.5328320Z         D: int,
2025-05-07T20:32:51.5328665Z         scale_ub: Optional[float],
2025-05-07T20:32:51.5329126Z         contiguous: bool,
2025-05-07T20:32:51.5329536Z         compiled: bool,
2025-05-07T20:32:51.5329898Z     ) -> None:
2025-05-07T20:32:51.5330263Z         torch.manual_seed(2025)
2025-05-07T20:32:51.5330650Z     
2025-05-07T20:32:51.5331076Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.5331676Z     
2025-05-07T20:32:51.5332013Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.5332524Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.5333062Z         x = x_sign * x_clamp
2025-05-07T20:32:51.5333465Z         x0 = x[:, :D]
2025-05-07T20:32:51.5333879Z         x1 = x[:, D:]
2025-05-07T20:32:51.5334224Z     
2025-05-07T20:32:51.5334647Z         if contiguous:
2025-05-07T20:32:51.5348099Z             x0 = x0.contiguous()
2025-05-07T20:32:51.5348601Z             x1 = x1.contiguous()
2025-05-07T20:32:51.5349035Z     
2025-05-07T20:32:51.5349369Z         if scale_ub is not None:
2025-05-07T20:32:51.5349844Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.5350433Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.5350976Z             )
2025-05-07T20:32:51.5351294Z         else:
2025-05-07T20:32:51.5351651Z             scale_ub_tensor = None
2025-05-07T20:32:51.5352098Z     
2025-05-07T20:32:51.5352485Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.5353039Z             op = silu_mul_quant
2025-05-07T20:32:51.5353691Z             if compiled:
2025-05-07T20:32:51.5354121Z                 op = torch.compile(op)
2025-05-07T20:32:51.5354637Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.5355117Z     
2025-05-07T20:32:51.5355434Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.5355899Z 
2025-05-07T20:32:51.5356072Z moe/activation_test.py:117: 
2025-05-07T20:32:51.5356588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.5357175Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.5357655Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.5358920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.5360198Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.5361167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.5362438Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.5363658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.5364478Z     kernel = self.compile(
2025-05-07T20:32:51.5365310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.5366344Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.5366953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.5367299Z 
2025-05-07T20:32:51.5367592Z self = <triton.compiler.compiler.ASTSource object at 0x7f127ab7efc0>
2025-05-07T20:32:51.5369186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.5371247Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f121b9384a0>}
2025-05-07T20:32:51.5373254Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.5374906Z context = <triton._C.libtriton.ir.context object at 0x7f121b911db0>
2025-05-07T20:32:51.5375333Z 
2025-05-07T20:32:51.5375571Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.5376329Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.5377017Z                            module_map=module_map)
2025-05-07T20:32:51.5377531Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.5378031Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.5378392Z E       ^
2025-05-07T20:32:51.5379072Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.5379747Z 
2025-05-07T20:32:51.5380367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.5381142Z 
2025-05-07T20:32:51.5381290Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.5381885Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.5382468Z     T=2048,
2025-05-07T20:32:51.5382726Z     D=5120,
2025-05-07T20:32:51.5383012Z     scale_ub=None,
2025-05-07T20:32:51.5383323Z     contiguous=True,
2025-05-07T20:32:51.5383659Z     compiled=False,
2025-05-07T20:32:51.5383976Z )
2025-05-07T20:32:51.5384428Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.5385271Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.5385682Z 
2025-05-07T20:32:51.5385794Z     @given(
2025-05-07T20:32:51.5386114Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.5386563Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.5387082Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.5387572Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.5388059Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.5388464Z     )
2025-05-07T20:32:51.5388991Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.5389688Z     def test_silu_mul_quant(
2025-05-07T20:32:51.5390047Z         self,
2025-05-07T20:32:51.5390365Z         T: int,
2025-05-07T20:32:51.5390685Z         D: int,
2025-05-07T20:32:51.5391036Z         scale_ub: Optional[float],
2025-05-07T20:32:51.5391512Z         contiguous: bool,
2025-05-07T20:32:51.5391920Z         compiled: bool,
2025-05-07T20:32:51.5392291Z     ) -> None:
2025-05-07T20:32:51.5392636Z         torch.manual_seed(2025)
2025-05-07T20:32:51.5393036Z     
2025-05-07T20:32:51.5393475Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.5394088Z     
2025-05-07T20:32:51.5394406Z >       x_sign = torch.sign(x)
2025-05-07T20:32:51.5397796Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.5401035Z 
2025-05-07T20:32:51.5401260Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:51.5401624Z 
2025-05-07T20:32:51.5401791Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.5402476Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.5403157Z     T=16384,
2025-05-07T20:32:51.5403487Z     D=5120,
2025-05-07T20:32:51.5403781Z     scale_ub=None,
2025-05-07T20:32:51.5404139Z     contiguous=True,
2025-05-07T20:32:51.5404495Z     compiled=False,
2025-05-07T20:32:51.5404810Z )
2025-05-07T20:32:51.6157216Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.6158094Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.6158571Z 
2025-05-07T20:32:51.6158705Z     @given(
2025-05-07T20:32:51.6159088Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.6159605Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.6160121Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.6160699Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.6161232Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.6161714Z     )
2025-05-07T20:32:51.6162306Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.6163067Z     def test_silu_mul_quant(
2025-05-07T20:32:51.6163474Z         self,
2025-05-07T20:32:51.6163847Z         T: int,
2025-05-07T20:32:51.6164161Z         D: int,
2025-05-07T20:32:51.6164518Z         scale_ub: Optional[float],
2025-05-07T20:32:51.6164960Z         contiguous: bool,
2025-05-07T20:32:51.6165347Z         compiled: bool,
2025-05-07T20:32:51.6165703Z     ) -> None:
2025-05-07T20:32:51.6166026Z         torch.manual_seed(2025)
2025-05-07T20:32:51.6166395Z     
2025-05-07T20:32:51.6166792Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.6170651Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.6174509Z 
2025-05-07T20:32:51.6174724Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.6175101Z 
2025-05-07T20:32:51.6175283Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.6175999Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.6176723Z     T=4096,
2025-05-07T20:32:51.6177035Z     D=5120,
2025-05-07T20:32:51.6177352Z     scale_ub=None,
2025-05-07T20:32:51.6177704Z     contiguous=True,
2025-05-07T20:32:51.6178055Z     compiled=False,
2025-05-07T20:32:51.6178373Z )
2025-05-07T20:32:51.6178926Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.6179773Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.6180220Z 
2025-05-07T20:32:51.6180371Z     @given(
2025-05-07T20:32:51.6180733Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.6181235Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.6181731Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.6182255Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.6182797Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.6183280Z     )
2025-05-07T20:32:51.6183861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.6184600Z     def test_silu_mul_quant(
2025-05-07T20:32:51.6185002Z         self,
2025-05-07T20:32:51.6185330Z         T: int,
2025-05-07T20:32:51.6185640Z         D: int,
2025-05-07T20:32:51.6186000Z         scale_ub: Optional[float],
2025-05-07T20:32:51.6186472Z         contiguous: bool,
2025-05-07T20:32:51.6186877Z         compiled: bool,
2025-05-07T20:32:51.6187269Z     ) -> None:
2025-05-07T20:32:51.6187636Z         torch.manual_seed(2025)
2025-05-07T20:32:51.6188037Z     
2025-05-07T20:32:51.6188477Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.6192299Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.6195879Z 
2025-05-07T20:32:51.6196079Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.6196459Z 
2025-05-07T20:32:51.6196642Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.6197370Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.6198095Z     T=2048,
2025-05-07T20:32:51.6198414Z     D=5120,
2025-05-07T20:32:51.6198773Z     scale_ub=None,
2025-05-07T20:32:51.6199128Z     contiguous=False,
2025-05-07T20:32:51.6199509Z     compiled=False,
2025-05-07T20:32:51.6199860Z )
2025-05-07T20:32:51.6200391Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.6201218Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.6201684Z 
2025-05-07T20:32:51.6201817Z     @given(
2025-05-07T20:32:51.6202188Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.6202702Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.6203369Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.6204011Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.6204592Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.6205095Z     )
2025-05-07T20:32:51.6205861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.6206652Z     def test_silu_mul_quant(
2025-05-07T20:32:51.6207077Z         self,
2025-05-07T20:32:51.6207401Z         T: int,
2025-05-07T20:32:51.6207720Z         D: int,
2025-05-07T20:32:51.6208093Z         scale_ub: Optional[float],
2025-05-07T20:32:51.6208578Z         contiguous: bool,
2025-05-07T20:32:51.6208974Z         compiled: bool,
2025-05-07T20:32:51.6209355Z     ) -> None:
2025-05-07T20:32:51.6209724Z         torch.manual_seed(2025)
2025-05-07T20:32:51.6210149Z     
2025-05-07T20:32:51.6210609Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.6214635Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.6218180Z 
2025-05-07T20:32:51.6218380Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.6218755Z 
2025-05-07T20:32:51.6218943Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.6219663Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.6220380Z     T=4096,
2025-05-07T20:32:51.6220714Z     D=7168,
2025-05-07T20:32:51.6221023Z     scale_ub=None,
2025-05-07T20:32:51.6221382Z     contiguous=True,
2025-05-07T20:32:51.6221763Z     compiled=True,
2025-05-07T20:32:51.6222093Z )
2025-05-07T20:32:51.6222565Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.6223281Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:51.6223701Z 
2025-05-07T20:32:51.6223816Z     @given(
2025-05-07T20:32:51.6224152Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.6224624Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.6225076Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.6225910Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.6226382Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.6226796Z     )
2025-05-07T20:32:51.6227295Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.6227940Z     def test_silu_mul_quant(
2025-05-07T20:32:51.6228291Z         self,
2025-05-07T20:32:51.6228575Z         T: int,
2025-05-07T20:32:51.6228858Z         D: int,
2025-05-07T20:32:51.6229166Z         scale_ub: Optional[float],
2025-05-07T20:32:51.6229542Z         contiguous: bool,
2025-05-07T20:32:51.6229883Z         compiled: bool,
2025-05-07T20:32:51.6230214Z     ) -> None:
2025-05-07T20:32:51.6230524Z         torch.manual_seed(2025)
2025-05-07T20:32:51.6230867Z     
2025-05-07T20:32:51.6231249Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.6234575Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.6237401Z 
2025-05-07T20:32:51.6237575Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.6237879Z 
2025-05-07T20:32:51.6238029Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.6238773Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.6239356Z     T=2048,
2025-05-07T20:32:51.6239609Z     D=5120,
2025-05-07T20:32:51.6239873Z     scale_ub=1200.0,
2025-05-07T20:32:51.6240182Z     contiguous=False,
2025-05-07T20:32:51.6240485Z     compiled=False,
2025-05-07T20:32:51.6240774Z )
2025-05-07T20:32:51.6241226Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.6241935Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.6242347Z 
2025-05-07T20:32:51.6242457Z     @given(
2025-05-07T20:32:51.6242774Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.6243232Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.6243660Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.6244132Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.6244599Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.6245014Z     )
2025-05-07T20:32:51.6245510Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.6246151Z     def test_silu_mul_quant(
2025-05-07T20:32:51.6246479Z         self,
2025-05-07T20:32:51.6246748Z         T: int,
2025-05-07T20:32:51.6247023Z         D: int,
2025-05-07T20:32:51.6247322Z         scale_ub: Optional[float],
2025-05-07T20:32:51.6247701Z         contiguous: bool,
2025-05-07T20:32:51.6248056Z         compiled: bool,
2025-05-07T20:32:51.6248371Z     ) -> None:
2025-05-07T20:32:51.6248668Z         torch.manual_seed(2025)
2025-05-07T20:32:51.6249003Z     
2025-05-07T20:32:51.6249410Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.6253025Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.6256444Z 
2025-05-07T20:32:51.6256640Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.6256997Z 
2025-05-07T20:32:51.6257161Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.6257843Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.6258504Z     T=4096,
2025-05-07T20:32:51.6258795Z     D=7168,
2025-05-07T20:32:51.6259117Z     scale_ub=1200.0,
2025-05-07T20:32:51.6259474Z     contiguous=True,
2025-05-07T20:32:51.6259822Z     compiled=False,
2025-05-07T20:32:51.6260150Z )
2025-05-07T20:32:51.7324137Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.7325091Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:51.7325846Z 
2025-05-07T20:32:51.7325989Z     @given(
2025-05-07T20:32:51.7326355Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.7326867Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.7327364Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.7327890Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.7328423Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.7328858Z     )
2025-05-07T20:32:51.7329427Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.7330617Z     def test_silu_mul_quant(
2025-05-07T20:32:51.7331014Z         self,
2025-05-07T20:32:51.7331326Z         T: int,
2025-05-07T20:32:51.7331619Z         D: int,
2025-05-07T20:32:51.7331959Z         scale_ub: Optional[float],
2025-05-07T20:32:51.7332407Z         contiguous: bool,
2025-05-07T20:32:51.7333036Z         compiled: bool,
2025-05-07T20:32:51.7333410Z     ) -> None:
2025-05-07T20:32:51.7333750Z         torch.manual_seed(2025)
2025-05-07T20:32:51.7334131Z     
2025-05-07T20:32:51.7334725Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.7338325Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.7341591Z 
2025-05-07T20:32:51.7341781Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.7342152Z 
2025-05-07T20:32:51.7342320Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.7342988Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.7343701Z     T=16384,
2025-05-07T20:32:51.7343982Z     D=7168,
2025-05-07T20:32:51.7344218Z     scale_ub=None,
2025-05-07T20:32:51.7344496Z     contiguous=False,
2025-05-07T20:32:51.7344789Z     compiled=True,
2025-05-07T20:32:51.7345066Z )
2025-05-07T20:32:51.7345535Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.7346295Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:51.7346745Z 
2025-05-07T20:32:51.7346883Z     @given(
2025-05-07T20:32:51.7347225Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.7347703Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.7348202Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.7348729Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.7349287Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.7349766Z     )
2025-05-07T20:32:51.7350341Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.7351110Z     def test_silu_mul_quant(
2025-05-07T20:32:51.7351515Z         self,
2025-05-07T20:32:51.7351821Z         T: int,
2025-05-07T20:32:51.7352152Z         D: int,
2025-05-07T20:32:51.7352510Z         scale_ub: Optional[float],
2025-05-07T20:32:51.7352955Z         contiguous: bool,
2025-05-07T20:32:51.7353363Z         compiled: bool,
2025-05-07T20:32:51.7353733Z     ) -> None:
2025-05-07T20:32:51.7354079Z         torch.manual_seed(2025)
2025-05-07T20:32:51.7354468Z     
2025-05-07T20:32:51.7354933Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.7358561Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.7361957Z 
2025-05-07T20:32:51.7362166Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.7362525Z 
2025-05-07T20:32:51.7362694Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.7363406Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.7364256Z     T=4096,
2025-05-07T20:32:51.7364583Z     D=7168,
2025-05-07T20:32:51.7364889Z     scale_ub=None,
2025-05-07T20:32:51.7365250Z     contiguous=True,
2025-05-07T20:32:51.7365631Z     compiled=False,
2025-05-07T20:32:51.7365964Z )
2025-05-07T20:32:51.7366614Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.7367474Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.7367946Z 
2025-05-07T20:32:51.7368069Z     @given(
2025-05-07T20:32:51.7368448Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.7368978Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.7369482Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.7370050Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.7370613Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.7371106Z     )
2025-05-07T20:32:51.7371708Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.7372487Z     def test_silu_mul_quant(
2025-05-07T20:32:51.7372874Z         self,
2025-05-07T20:32:51.7373145Z         T: int,
2025-05-07T20:32:51.7373437Z         D: int,
2025-05-07T20:32:51.7373766Z         scale_ub: Optional[float],
2025-05-07T20:32:51.7374211Z         contiguous: bool,
2025-05-07T20:32:51.7374709Z         compiled: bool,
2025-05-07T20:32:51.7375074Z     ) -> None:
2025-05-07T20:32:51.7375412Z         torch.manual_seed(2025)
2025-05-07T20:32:51.7375816Z     
2025-05-07T20:32:51.7376263Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.7379910Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.7383279Z 
2025-05-07T20:32:51.7383501Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.7383858Z 
2025-05-07T20:32:51.7384030Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.7384726Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.7385416Z     T=16384,
2025-05-07T20:32:51.7385724Z     D=7168,
2025-05-07T20:32:51.7386042Z     scale_ub=None,
2025-05-07T20:32:51.7386387Z     contiguous=True,
2025-05-07T20:32:51.7386750Z     compiled=False,
2025-05-07T20:32:51.7387086Z )
2025-05-07T20:32:51.7387624Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.7388477Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:51.7388965Z 
2025-05-07T20:32:51.7389105Z     @given(
2025-05-07T20:32:51.7389478Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.7390006Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.7390515Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.7391081Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.7391637Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.7392110Z     )
2025-05-07T20:32:51.7392701Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.7393463Z     def test_silu_mul_quant(
2025-05-07T20:32:51.7393885Z         self,
2025-05-07T20:32:51.7394162Z         T: int,
2025-05-07T20:32:51.7394464Z         D: int,
2025-05-07T20:32:51.7394782Z         scale_ub: Optional[float],
2025-05-07T20:32:51.7395173Z         contiguous: bool,
2025-05-07T20:32:51.7395547Z         compiled: bool,
2025-05-07T20:32:51.7395851Z     ) -> None:
2025-05-07T20:32:51.7396331Z         torch.manual_seed(2025)
2025-05-07T20:32:51.7396705Z     
2025-05-07T20:32:51.7397150Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.7400582Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.7404153Z 
2025-05-07T20:32:51.7404373Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.7404740Z 
2025-05-07T20:32:51.7404926Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.7405640Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.7406349Z     T=16384,
2025-05-07T20:32:51.7406669Z     D=7168,
2025-05-07T20:32:51.7406977Z     scale_ub=1200.0,
2025-05-07T20:32:51.7407352Z     contiguous=True,
2025-05-07T20:32:51.7407741Z     compiled=False,
2025-05-07T20:32:51.7421534Z )
2025-05-07T20:32:51.7422098Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.7422945Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:51.7423436Z 
2025-05-07T20:32:51.7423566Z     @given(
2025-05-07T20:32:51.7423961Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.7424486Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.7425016Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.7425921Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.7426492Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.7426975Z     )
2025-05-07T20:32:51.7427582Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.7428355Z     def test_silu_mul_quant(
2025-05-07T20:32:51.7428752Z         self,
2025-05-07T20:32:51.7429073Z         T: int,
2025-05-07T20:32:51.7429413Z         D: int,
2025-05-07T20:32:51.7429765Z         scale_ub: Optional[float],
2025-05-07T20:32:51.7430233Z         contiguous: bool,
2025-05-07T20:32:51.7430632Z         compiled: bool,
2025-05-07T20:32:51.7430996Z     ) -> None:
2025-05-07T20:32:51.7431358Z         torch.manual_seed(2025)
2025-05-07T20:32:51.7431765Z     
2025-05-07T20:32:51.7432213Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.7435953Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.7439348Z 
2025-05-07T20:32:51.7439556Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.7439932Z 
2025-05-07T20:32:51.7440105Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.7440814Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.7441502Z     T=128,
2025-05-07T20:32:51.7441810Z     D=5120,
2025-05-07T20:32:51.7442138Z     scale_ub=1200.0,
2025-05-07T20:32:51.7442506Z     contiguous=False,
2025-05-07T20:32:51.7442892Z     compiled=False,
2025-05-07T20:32:51.7443236Z )
2025-05-07T20:32:51.8694265Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.8695392Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:51.8695701Z 
2025-05-07T20:32:51.8695791Z     @given(
2025-05-07T20:32:51.8696038Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.8696370Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.8696864Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.8697214Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.8697562Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.8697857Z     )
2025-05-07T20:32:51.8698215Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.8698674Z     def test_silu_mul_quant(
2025-05-07T20:32:51.8698917Z         self,
2025-05-07T20:32:51.8699133Z         T: int,
2025-05-07T20:32:51.8699344Z         D: int,
2025-05-07T20:32:51.8699576Z         scale_ub: Optional[float],
2025-05-07T20:32:51.8699863Z         contiguous: bool,
2025-05-07T20:32:51.8700125Z         compiled: bool,
2025-05-07T20:32:51.8700370Z     ) -> None:
2025-05-07T20:32:51.8700584Z         torch.manual_seed(2025)
2025-05-07T20:32:51.8700837Z     
2025-05-07T20:32:51.8701125Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.8701484Z     
2025-05-07T20:32:51.8701692Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.8701998Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.8702317Z         x = x_sign * x_clamp
2025-05-07T20:32:51.8702569Z         x0 = x[:, :D]
2025-05-07T20:32:51.8702800Z         x1 = x[:, D:]
2025-05-07T20:32:51.8703018Z     
2025-05-07T20:32:51.8703209Z         if contiguous:
2025-05-07T20:32:51.8703453Z             x0 = x0.contiguous()
2025-05-07T20:32:51.8703710Z             x1 = x1.contiguous()
2025-05-07T20:32:51.8703965Z     
2025-05-07T20:32:51.8704174Z         if scale_ub is not None:
2025-05-07T20:32:51.8704449Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.8704802Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.8705133Z             )
2025-05-07T20:32:51.8705337Z         else:
2025-05-07T20:32:51.8705555Z             scale_ub_tensor = None
2025-05-07T20:32:51.8705821Z     
2025-05-07T20:32:51.8706056Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.8706383Z             op = silu_mul_quant
2025-05-07T20:32:51.8706647Z             if compiled:
2025-05-07T20:32:51.8706903Z                 op = torch.compile(op)
2025-05-07T20:32:51.8707213Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.8707504Z     
2025-05-07T20:32:51.8707703Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.8707870Z 
2025-05-07T20:32:51.8707975Z moe/activation_test.py:117: 
2025-05-07T20:32:51.8708280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.8708630Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.8708920Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.8709649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.8710382Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.8710949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.8711666Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.8712365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.8712928Z     kernel = self.compile(
2025-05-07T20:32:51.8713496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.8714180Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.8714594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.8714831Z 
2025-05-07T20:32:51.8715140Z self = <triton.compiler.compiler.ASTSource object at 0x7f121b865790>
2025-05-07T20:32:51.8716270Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.8717798Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f121b943060>}
2025-05-07T20:32:51.8719216Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.8720323Z context = <triton._C.libtriton.ir.context object at 0x7f121b72c8f0>
2025-05-07T20:32:51.8720631Z 
2025-05-07T20:32:51.8720822Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.8721371Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.8721871Z                            module_map=module_map)
2025-05-07T20:32:51.8722260Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.8722636Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.8722912Z E       ^
2025-05-07T20:32:51.8723402Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.8723876Z 
2025-05-07T20:32:51.8724325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.8724871Z 
2025-05-07T20:32:51.8724980Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.8725645Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.8726206Z     T=2048,
2025-05-07T20:32:51.8726463Z     D=7168,
2025-05-07T20:32:51.8726706Z     scale_ub=None,
2025-05-07T20:32:51.8726999Z     contiguous=False,
2025-05-07T20:32:51.8727288Z     compiled=False,
2025-05-07T20:32:51.8727506Z )
2025-05-07T20:32:51.8727849Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.8728380Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:51.8728669Z 
2025-05-07T20:32:51.8728751Z     @given(
2025-05-07T20:32:51.8728993Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.8729320Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.8729636Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.8729978Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.8730322Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.8730630Z     )
2025-05-07T20:32:51.8730985Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.8731458Z     def test_silu_mul_quant(
2025-05-07T20:32:51.8731716Z         self,
2025-05-07T20:32:51.8731914Z         T: int,
2025-05-07T20:32:51.8732115Z         D: int,
2025-05-07T20:32:51.8732339Z         scale_ub: Optional[float],
2025-05-07T20:32:51.8732621Z         contiguous: bool,
2025-05-07T20:32:51.8732875Z         compiled: bool,
2025-05-07T20:32:51.8733104Z     ) -> None:
2025-05-07T20:32:51.8733317Z         torch.manual_seed(2025)
2025-05-07T20:32:51.8733566Z     
2025-05-07T20:32:51.8733846Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.8736371Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.8738375Z 
2025-05-07T20:32:51.8738511Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:51.8738855Z 
2025-05-07T20:32:51.8738962Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.8739403Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.8739839Z     T=128,
2025-05-07T20:32:51.8740031Z     D=7168,
2025-05-07T20:32:51.8740237Z     scale_ub=1200.0,
2025-05-07T20:32:51.8740468Z     contiguous=True,
2025-05-07T20:32:51.8740694Z     compiled=True,
2025-05-07T20:32:51.8740912Z )
2025-05-07T20:32:51.9055624Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.9056235Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:51.9056550Z 
2025-05-07T20:32:51.9056641Z     @given(
2025-05-07T20:32:51.9056905Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.9057268Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.9057621Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.9058004Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.9058384Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.9058710Z     )
2025-05-07T20:32:51.9059114Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.9059637Z     def test_silu_mul_quant(
2025-05-07T20:32:51.9059906Z         self,
2025-05-07T20:32:51.9060113Z         T: int,
2025-05-07T20:32:51.9060334Z         D: int,
2025-05-07T20:32:51.9060576Z         scale_ub: Optional[float],
2025-05-07T20:32:51.9060882Z         contiguous: bool,
2025-05-07T20:32:51.9061142Z         compiled: bool,
2025-05-07T20:32:51.9061390Z     ) -> None:
2025-05-07T20:32:51.9061625Z         torch.manual_seed(2025)
2025-05-07T20:32:51.9061893Z     
2025-05-07T20:32:51.9062198Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.9062591Z     
2025-05-07T20:32:51.9062795Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.9063118Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.9063478Z         x = x_sign * x_clamp
2025-05-07T20:32:51.9063735Z         x0 = x[:, :D]
2025-05-07T20:32:51.9063974Z         x1 = x[:, D:]
2025-05-07T20:32:51.9064204Z     
2025-05-07T20:32:51.9064400Z         if contiguous:
2025-05-07T20:32:51.9064655Z             x0 = x0.contiguous()
2025-05-07T20:32:51.9064944Z             x1 = x1.contiguous()
2025-05-07T20:32:51.9065206Z     
2025-05-07T20:32:51.9065418Z         if scale_ub is not None:
2025-05-07T20:32:51.9065723Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.9066102Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.9066445Z             )
2025-05-07T20:32:51.9066659Z         else:
2025-05-07T20:32:51.9066899Z             scale_ub_tensor = None
2025-05-07T20:32:51.9067182Z     
2025-05-07T20:32:51.9067442Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.9067807Z             op = silu_mul_quant
2025-05-07T20:32:51.9068082Z             if compiled:
2025-05-07T20:32:51.9068353Z                 op = torch.compile(op)
2025-05-07T20:32:51.9068685Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.9068987Z     
2025-05-07T20:32:51.9069201Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.9069383Z 
2025-05-07T20:32:51.9069503Z moe/activation_test.py:117: 
2025-05-07T20:32:51.9069836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.9070210Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.9070524Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.9071185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:51.9072138Z     return fn(*args, **kwargs)
2025-05-07T20:32:51.9072837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.9073572Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.9074306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.9075022Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.9075721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.9076285Z     kernel = self.compile(
2025-05-07T20:32:51.9076846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.9077537Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.9077967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.9078207Z 
2025-05-07T20:32:51.9078427Z self = <triton.compiler.compiler.ASTSource object at 0x7f121b7d1940>
2025-05-07T20:32:51.9079555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.9081011Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f121b7cc900>}
2025-05-07T20:32:51.9082431Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.9083521Z context = <triton._C.libtriton.ir.context object at 0x7f121b7a1d70>
2025-05-07T20:32:51.9083822Z 
2025-05-07T20:32:51.9084003Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.9084542Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.9085035Z                            module_map=module_map)
2025-05-07T20:32:51.9085416Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.9085775Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.9086047Z E       ^
2025-05-07T20:32:51.9086529Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.9087002Z 
2025-05-07T20:32:51.9087447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.9087996Z 
2025-05-07T20:32:51.9088103Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.9088530Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.9088959Z     T=128,
2025-05-07T20:32:51.9089148Z     D=7168,
2025-05-07T20:32:51.9089346Z     scale_ub=1200.0,
2025-05-07T20:32:51.9089575Z     contiguous=True,
2025-05-07T20:32:51.9089795Z     compiled=False,
2025-05-07T20:32:51.9090009Z )
2025-05-07T20:32:51.9090336Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.9090849Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:51.9091132Z 
2025-05-07T20:32:51.9091214Z     @given(
2025-05-07T20:32:51.9091475Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.9091798Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.9092117Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.9092452Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.9092797Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.9093095Z     )
2025-05-07T20:32:51.9093539Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.9094001Z     def test_silu_mul_quant(
2025-05-07T20:32:51.9094249Z         self,
2025-05-07T20:32:51.9094590Z         T: int,
2025-05-07T20:32:51.9094793Z         D: int,
2025-05-07T20:32:51.9095103Z         scale_ub: Optional[float],
2025-05-07T20:32:51.9095381Z         contiguous: bool,
2025-05-07T20:32:51.9095618Z         compiled: bool,
2025-05-07T20:32:51.9095844Z     ) -> None:
2025-05-07T20:32:51.9096061Z         torch.manual_seed(2025)
2025-05-07T20:32:51.9096304Z     
2025-05-07T20:32:51.9096580Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.9096942Z     
2025-05-07T20:32:51.9097131Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.9097427Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.9099574Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.9101572Z 
2025-05-07T20:32:51.9101696Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:51.9101915Z 
2025-05-07T20:32:51.9102029Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.9102452Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.9102880Z     T=128,
2025-05-07T20:32:51.9103068Z     D=5120,
2025-05-07T20:32:51.9103253Z     scale_ub=1200.0,
2025-05-07T20:32:51.9103471Z     contiguous=True,
2025-05-07T20:32:51.9103692Z     compiled=True,
2025-05-07T20:32:51.9103891Z )
2025-05-07T20:32:51.9104217Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.9104727Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:51.9105006Z 
2025-05-07T20:32:51.9105084Z     @given(
2025-05-07T20:32:51.9105316Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.9105630Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.9105939Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.9106266Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.9106602Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.9106895Z     )
2025-05-07T20:32:51.9107243Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.9107701Z     def test_silu_mul_quant(
2025-05-07T20:32:51.9107948Z         self,
2025-05-07T20:32:51.9108143Z         T: int,
2025-05-07T20:32:51.9108347Z         D: int,
2025-05-07T20:32:51.9108569Z         scale_ub: Optional[float],
2025-05-07T20:32:51.9108837Z         contiguous: bool,
2025-05-07T20:32:51.9109079Z         compiled: bool,
2025-05-07T20:32:51.9109299Z     ) -> None:
2025-05-07T20:32:51.9109506Z         torch.manual_seed(2025)
2025-05-07T20:32:51.9109755Z     
2025-05-07T20:32:51.9110030Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.9110377Z     
2025-05-07T20:32:51.9110573Z >       x_sign = torch.sign(x)
2025-05-07T20:32:51.9112742Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:51.9114742Z 
2025-05-07T20:32:51.9114865Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:51.9115085Z 
2025-05-07T20:32:51.9115198Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.9115702Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.9116132Z     T=128,
2025-05-07T20:32:51.9116330Z     D=7168,
2025-05-07T20:32:51.9116528Z     scale_ub=None,
2025-05-07T20:32:51.9116752Z     contiguous=True,
2025-05-07T20:32:51.9116983Z     compiled=True,
2025-05-07T20:32:51.9117190Z )
2025-05-07T20:32:52.2511361Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.2511926Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:52.2512223Z 
2025-05-07T20:32:52.2512309Z     @given(
2025-05-07T20:32:52.2512558Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.2512909Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.2513232Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.2513582Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.2513960Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.2514286Z     )
2025-05-07T20:32:52.2514664Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.2515141Z     def test_silu_mul_quant(
2025-05-07T20:32:52.2515393Z         self,
2025-05-07T20:32:52.2515617Z         T: int,
2025-05-07T20:32:52.2515832Z         D: int,
2025-05-07T20:32:52.2516059Z         scale_ub: Optional[float],
2025-05-07T20:32:52.2516354Z         contiguous: bool,
2025-05-07T20:32:52.2516615Z         compiled: bool,
2025-05-07T20:32:52.2516846Z     ) -> None:
2025-05-07T20:32:52.2517072Z         torch.manual_seed(2025)
2025-05-07T20:32:52.2517329Z     
2025-05-07T20:32:52.2517610Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.2519825Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.2521848Z 
2025-05-07T20:32:52.2521973Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:52.2522205Z 
2025-05-07T20:32:52.2608603Z FAILED
2025-05-07T20:32:52.2608750Z 
2025-05-07T20:32:52.2608888Z =================================== FAILURES ===================================
2025-05-07T20:32:52.2609341Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:52.2609806Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:52.2610456Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:32:52.2611028Z   |     yield
2025-05-07T20:32:52.2611492Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run
2025-05-07T20:32:52.2612177Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:52.2612830Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
2025-05-07T20:32:52.2613421Z   |     if method() is not None:
2025-05-07T20:32:52.2613675Z   |        ^^^^^^^^
2025-05-07T20:32:52.2614638Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:52.2615522Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.2616140Z   |            ^^^^^^^
2025-05-07T20:32:52.2616768Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:52.2617436Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:52.2618021Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:52.2618464Z   +-+---------------- 1 ----------------
2025-05-07T20:32:52.2618777Z     | Traceback (most recent call last):
2025-05-07T20:32:52.2619538Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:52.2620433Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.2620835Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:52.2623235Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.2639944Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:52.2640614Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.2641073Z     |     T=128,
2025-05-07T20:32:52.2641351Z     |     D=7168,
2025-05-07T20:32:52.2641604Z     |     scale_ub=1200.0,
2025-05-07T20:32:52.2641932Z     |     contiguous=True,
2025-05-07T20:32:52.2642179Z     |     compiled=False,
2025-05-07T20:32:52.2642420Z     | )
2025-05-07T20:32:52.2642622Z     | 
2025-05-07T20:32:52.2643187Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case
2025-05-07T20:32:52.2643839Z     +---------------- 2 ----------------
2025-05-07T20:32:52.2644158Z     | Traceback (most recent call last):
2025-05-07T20:32:52.2644965Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:52.2645840Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.2646239Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:52.2648360Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.2650476Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:52.2650941Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.2651372Z     |     T=128,
2025-05-07T20:32:52.2651585Z     |     D=7168,
2025-05-07T20:32:52.2651804Z     |     scale_ub=None,
2025-05-07T20:32:52.2652046Z     |     contiguous=True,
2025-05-07T20:32:52.2652298Z     |     compiled=True,
2025-05-07T20:32:52.2652534Z     | )
2025-05-07T20:32:52.2652717Z     | 
2025-05-07T20:32:52.2653265Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:52.2653916Z     +---------------- 3 ----------------
2025-05-07T20:32:52.2654517Z     | Traceback (most recent call last):
2025-05-07T20:32:52.2655271Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:52.2656211Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.2656607Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:52.2658714Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.2660809Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:52.2661282Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.2661715Z     |     T=128,
2025-05-07T20:32:52.2661936Z     |     D=5120,
2025-05-07T20:32:52.2662153Z     |     scale_ub=1200.0,
2025-05-07T20:32:52.2662412Z     |     contiguous=True,
2025-05-07T20:32:52.2662672Z     |     compiled=True,
2025-05-07T20:32:52.2662908Z     | )
2025-05-07T20:32:52.2663123Z     | 
2025-05-07T20:32:52.2663701Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:52.2664603Z     +---------------- 4 ----------------
2025-05-07T20:32:52.2665028Z     | Traceback (most recent call last):
2025-05-07T20:32:52.2666084Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:52.2667131Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:52.2667554Z     |                              ^^^^^^^^
2025-05-07T20:32:52.2668480Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:52.2669511Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.2669986Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:52.2671158Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:52.2672327Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:52.2673213Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:52.2674288Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.2674950Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:52.2675905Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:52.2677034Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.2677712Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:52.2678637Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:52.2679651Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:52.2680179Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:52.2681134Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:52.2681972Z     |     fn()
2025-05-07T20:32:52.2682802Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:52.2683699Z     |     self.fn.run(
2025-05-07T20:32:52.2684264Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:52.2684879Z     |     kernel = self.compile(
2025-05-07T20:32:52.2685157Z     |              ^^^^^^^^^^^^^
2025-05-07T20:32:52.2685775Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:52.2686524Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.2686933Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:52.2687620Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:52.2688467Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.2688976Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:52.2689377Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.2689745Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:52.2690022Z     | ^
2025-05-07T20:32:52.2690507Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.2691101Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:52.2691513Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:52.2692060Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.2692515Z     |     T=1,  # or any other generated value
2025-05-07T20:32:52.2692832Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:52.2693186Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:52.2693565Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:52.2693975Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:52.2694300Z     | )
2025-05-07T20:32:52.2694604Z     | 
2025-05-07T20:32:52.2695158Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:52.2695789Z     +------------------------------------
2025-05-07T20:32:52.2696160Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:52.2696556Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.2696983Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.2697418Z     T=1,
2025-05-07T20:32:52.2697618Z     D=5120,
2025-05-07T20:32:52.2697815Z     scale_ub=None,
2025-05-07T20:32:52.2698043Z     contiguous=True,
2025-05-07T20:32:52.2698282Z     compiled=True,
2025-05-07T20:32:52.2698503Z )
2025-05-07T20:32:52.2698840Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.2699348Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:52.2699623Z 
2025-05-07T20:32:52.2699715Z     @given(
2025-05-07T20:32:52.2699952Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.2700287Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.2700611Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.2700952Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.2701374Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.2701784Z     )
2025-05-07T20:32:52.2702393Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.2703047Z     def test_silu_mul_quant(
2025-05-07T20:32:52.2703406Z         self,
2025-05-07T20:32:52.2703689Z         T: int,
2025-05-07T20:32:52.2703949Z         D: int,
2025-05-07T20:32:52.2704343Z         scale_ub: Optional[float],
2025-05-07T20:32:52.2704724Z         contiguous: bool,
2025-05-07T20:32:52.2705055Z         compiled: bool,
2025-05-07T20:32:52.2705382Z     ) -> None:
2025-05-07T20:32:52.2705703Z         torch.manual_seed(2025)
2025-05-07T20:32:52.2706051Z     
2025-05-07T20:32:52.2706447Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.2706950Z     
2025-05-07T20:32:52.2707222Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.2707646Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.2708092Z         x = x_sign * x_clamp
2025-05-07T20:32:52.2708436Z         x0 = x[:, :D]
2025-05-07T20:32:52.2708768Z         x1 = x[:, D:]
2025-05-07T20:32:52.2709077Z     
2025-05-07T20:32:52.2709348Z         if contiguous:
2025-05-07T20:32:52.2709692Z             x0 = x0.contiguous()
2025-05-07T20:32:52.2710068Z             x1 = x1.contiguous()
2025-05-07T20:32:52.2710424Z     
2025-05-07T20:32:52.2710693Z         if scale_ub is not None:
2025-05-07T20:32:52.2711103Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.2711586Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.2712028Z             )
2025-05-07T20:32:52.2712312Z         else:
2025-05-07T20:32:52.2712619Z             scale_ub_tensor = None
2025-05-07T20:32:52.2712990Z     
2025-05-07T20:32:52.2713318Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.2713797Z             op = silu_mul_quant
2025-05-07T20:32:52.2714163Z             if compiled:
2025-05-07T20:32:52.2714532Z                 op = torch.compile(op)
2025-05-07T20:32:52.2714958Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.2715361Z     
2025-05-07T20:32:52.2715649Z         y_fp8, y_scale = fn()
2025-05-07T20:32:52.2716062Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:52.2716502Z     
2025-05-07T20:32:52.2716839Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.2717336Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:52.2717778Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:52.2718229Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:52.2718765Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.2719250Z     
2025-05-07T20:32:52.2719541Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:52.2719840Z 
2025-05-07T20:32:52.2719986Z moe/activation_test.py:126: 
2025-05-07T20:32:52.2720445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2720947Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:52.2721415Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.2722580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:52.2723688Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:52.2724533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.2725850Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.2726888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:52.2727975Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.2729017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:52.2729952Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:52.2731063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:52.2731841Z     fn()
2025-05-07T20:32:52.2732583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:52.2733589Z     self.fn.run(
2025-05-07T20:32:52.2734270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.2735186Z     kernel = self.compile(
2025-05-07T20:32:52.2735972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.2736926Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.2737496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2737820Z 
2025-05-07T20:32:52.2738103Z self = <triton.compiler.compiler.ASTSource object at 0x7f12968c1520>
2025-05-07T20:32:52.2739650Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.2741687Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1295394540>}
2025-05-07T20:32:52.2743675Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.2745178Z context = <triton._C.libtriton.ir.context object at 0x7f12953900f0>
2025-05-07T20:32:52.2745596Z 
2025-05-07T20:32:52.2745826Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.2746546Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.2747185Z                            module_map=module_map)
2025-05-07T20:32:52.2747659Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.2748137Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:52.2748517Z E       ^
2025-05-07T20:32:52.2749137Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.2749767Z 
2025-05-07T20:32:52.2750342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.2751061Z 
2025-05-07T20:32:52.2751205Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.2751777Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.2752325Z     T=2048,
2025-05-07T20:32:52.2752586Z     D=5120,
2025-05-07T20:32:52.2752849Z     scale_ub=1200.0,
2025-05-07T20:32:52.2753157Z     contiguous=True,
2025-05-07T20:32:52.2753481Z     compiled=False,
2025-05-07T20:32:52.2753785Z )
2025-05-07T20:32:52.2754266Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.2754948Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:52.2755336Z 
2025-05-07T20:32:52.2755440Z     @given(
2025-05-07T20:32:52.2755734Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.2756143Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.2756593Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.2757060Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.2757475Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.2757867Z     )
2025-05-07T20:32:52.2758335Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.2758910Z     def test_silu_mul_quant(
2025-05-07T20:32:52.2759231Z         self,
2025-05-07T20:32:52.2759582Z         T: int,
2025-05-07T20:32:52.2759825Z         D: int,
2025-05-07T20:32:52.2760122Z         scale_ub: Optional[float],
2025-05-07T20:32:52.2760493Z         contiguous: bool,
2025-05-07T20:32:52.2760814Z         compiled: bool,
2025-05-07T20:32:52.2761204Z     ) -> None:
2025-05-07T20:32:52.2761494Z         torch.manual_seed(2025)
2025-05-07T20:32:52.2761813Z     
2025-05-07T20:32:52.2762141Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.2762605Z     
2025-05-07T20:32:52.2762871Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.2763257Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.2763690Z         x = x_sign * x_clamp
2025-05-07T20:32:52.2764027Z         x0 = x[:, :D]
2025-05-07T20:32:52.2764325Z         x1 = x[:, D:]
2025-05-07T20:32:52.2764614Z     
2025-05-07T20:32:52.2764867Z         if contiguous:
2025-05-07T20:32:52.2765177Z             x0 = x0.contiguous()
2025-05-07T20:32:52.2765547Z             x1 = x1.contiguous()
2025-05-07T20:32:52.2765874Z     
2025-05-07T20:32:52.2766124Z         if scale_ub is not None:
2025-05-07T20:32:52.2766492Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.2766932Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.2767348Z             )
2025-05-07T20:32:52.2767596Z         else:
2025-05-07T20:32:52.2767870Z             scale_ub_tensor = None
2025-05-07T20:32:52.2768203Z     
2025-05-07T20:32:52.2768496Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.2768918Z             op = silu_mul_quant
2025-05-07T20:32:52.2769246Z             if compiled:
2025-05-07T20:32:52.2769558Z                 op = torch.compile(op)
2025-05-07T20:32:52.2769955Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.2770330Z     
2025-05-07T20:32:52.2770576Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.2770796Z 
2025-05-07T20:32:52.2770930Z moe/activation_test.py:117: 
2025-05-07T20:32:52.2771328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2771771Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.2772147Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.2773065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.2774005Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.2774826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.2775751Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.2776649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.2777370Z     kernel = self.compile(
2025-05-07T20:32:52.2778095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.2778982Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.2779512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2779815Z 
2025-05-07T20:32:52.2780094Z self = <triton.compiler.compiler.ASTSource object at 0x7f1295328890>
2025-05-07T20:32:52.2781544Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.2783414Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1295c8f240>}
2025-05-07T20:32:52.2785237Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.2786640Z context = <triton._C.libtriton.ir.context object at 0x7f1295ea9f70>
2025-05-07T20:32:52.2787034Z 
2025-05-07T20:32:52.2787241Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.2787993Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.2788591Z                            module_map=module_map)
2025-05-07T20:32:52.2789059Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.2789519Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.2789876Z E       ^
2025-05-07T20:32:52.2790475Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.2791080Z 
2025-05-07T20:32:52.2791609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.2792280Z 
2025-05-07T20:32:52.2792416Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.2792938Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.2793452Z     T=2048,
2025-05-07T20:32:52.2793703Z     D=5120,
2025-05-07T20:32:52.2793975Z     scale_ub=1200.0,
2025-05-07T20:32:52.2794269Z     contiguous=True,
2025-05-07T20:32:52.2794557Z     compiled=True,
2025-05-07T20:32:52.2794816Z )
2025-05-07T20:32:52.2795231Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.2795881Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:52.2796253Z 
2025-05-07T20:32:52.2796357Z     @given(
2025-05-07T20:32:52.2796653Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.2797053Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.2797459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.2797903Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.2798340Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.2798719Z     )
2025-05-07T20:32:52.2799183Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.2799784Z     def test_silu_mul_quant(
2025-05-07T20:32:52.2800093Z         self,
2025-05-07T20:32:52.2800352Z         T: int,
2025-05-07T20:32:52.2800612Z         D: int,
2025-05-07T20:32:52.2800892Z         scale_ub: Optional[float],
2025-05-07T20:32:52.2801250Z         contiguous: bool,
2025-05-07T20:32:52.2801568Z         compiled: bool,
2025-05-07T20:32:52.2801855Z     ) -> None:
2025-05-07T20:32:52.2802143Z         torch.manual_seed(2025)
2025-05-07T20:32:52.2802462Z     
2025-05-07T20:32:52.2802815Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.2803273Z     
2025-05-07T20:32:52.2803530Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.2803902Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.2804320Z         x = x_sign * x_clamp
2025-05-07T20:32:52.2804642Z         x0 = x[:, :D]
2025-05-07T20:32:52.2804950Z         x1 = x[:, D:]
2025-05-07T20:32:52.2805226Z     
2025-05-07T20:32:52.2805489Z         if contiguous:
2025-05-07T20:32:52.2805814Z             x0 = x0.contiguous()
2025-05-07T20:32:52.2806157Z             x1 = x1.contiguous()
2025-05-07T20:32:52.2806498Z     
2025-05-07T20:32:52.2806777Z         if scale_ub is not None:
2025-05-07T20:32:52.2807137Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.2807600Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.2808041Z             )
2025-05-07T20:32:52.2808310Z         else:
2025-05-07T20:32:52.2808630Z             scale_ub_tensor = None
2025-05-07T20:32:52.2809018Z     
2025-05-07T20:32:52.2809337Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.2809761Z             op = silu_mul_quant
2025-05-07T20:32:52.2810100Z             if compiled:
2025-05-07T20:32:52.2810524Z                 op = torch.compile(op)
2025-05-07T20:32:52.2810938Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.2811313Z     
2025-05-07T20:32:52.2811580Z         y_fp8, y_scale = fn()
2025-05-07T20:32:52.2811951Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:52.2812478Z     
2025-05-07T20:32:52.2812805Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.2813267Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:52.2813684Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:52.2814134Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:52.2814832Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.2815264Z     
2025-05-07T20:32:52.2815550Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:52.2815829Z 
2025-05-07T20:32:52.2815971Z moe/activation_test.py:126: 
2025-05-07T20:32:52.2816404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2816860Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:52.2817300Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.2818399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:52.2819452Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:52.2820210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.2821174Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.2839756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:52.2840774Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.2841807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:52.2842697Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:52.2843541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:52.2844300Z     fn()
2025-05-07T20:32:52.2845052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:52.2845904Z     self.fn.run(
2025-05-07T20:32:52.2846566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.2847322Z     kernel = self.compile(
2025-05-07T20:32:52.2848058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.2848952Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.2849510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2849822Z 
2025-05-07T20:32:52.2850098Z self = <triton.compiler.compiler.ASTSource object at 0x7f1296786510>
2025-05-07T20:32:52.2851621Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.2853568Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f129679a020>}
2025-05-07T20:32:52.2855674Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.2857170Z context = <triton._C.libtriton.ir.context object at 0x7f12956082b0>
2025-05-07T20:32:52.2857896Z 
2025-05-07T20:32:52.2858136Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.2858893Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.2859725Z                            module_map=module_map)
2025-05-07T20:32:52.2860246Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.2860747Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:52.2861118Z E       ^
2025-05-07T20:32:52.2861766Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.2862415Z 
2025-05-07T20:32:52.2863035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.2863765Z 
2025-05-07T20:32:52.2863914Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.2864503Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.2865087Z     T=16384,
2025-05-07T20:32:52.2865366Z     D=7168,
2025-05-07T20:32:52.2865651Z     scale_ub=1200.0,
2025-05-07T20:32:52.2865973Z     contiguous=False,
2025-05-07T20:32:52.2866293Z     compiled=False,
2025-05-07T20:32:52.2866582Z )
2025-05-07T20:32:52.2867030Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.2867726Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:52.2868124Z 
2025-05-07T20:32:52.2868229Z     @given(
2025-05-07T20:32:52.2868543Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.2868976Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.2869388Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.2869824Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.2870251Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.2870627Z     )
2025-05-07T20:32:52.2871129Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.2871752Z     def test_silu_mul_quant(
2025-05-07T20:32:52.2872083Z         self,
2025-05-07T20:32:52.2872342Z         T: int,
2025-05-07T20:32:52.2872626Z         D: int,
2025-05-07T20:32:52.2872912Z         scale_ub: Optional[float],
2025-05-07T20:32:52.2873278Z         contiguous: bool,
2025-05-07T20:32:52.2873599Z         compiled: bool,
2025-05-07T20:32:52.2873896Z     ) -> None:
2025-05-07T20:32:52.2874187Z         torch.manual_seed(2025)
2025-05-07T20:32:52.2874522Z     
2025-05-07T20:32:52.2874886Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.2875348Z     
2025-05-07T20:32:52.2875617Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.2876011Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.2876430Z         x = x_sign * x_clamp
2025-05-07T20:32:52.2876761Z         x0 = x[:, :D]
2025-05-07T20:32:52.2877072Z         x1 = x[:, D:]
2025-05-07T20:32:52.2877367Z     
2025-05-07T20:32:52.2877627Z         if contiguous:
2025-05-07T20:32:52.2877956Z             x0 = x0.contiguous()
2025-05-07T20:32:52.2878318Z             x1 = x1.contiguous()
2025-05-07T20:32:52.2878665Z     
2025-05-07T20:32:52.2878947Z         if scale_ub is not None:
2025-05-07T20:32:52.2879330Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.2879786Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.2880204Z             )
2025-05-07T20:32:52.2880440Z         else:
2025-05-07T20:32:52.2880665Z             scale_ub_tensor = None
2025-05-07T20:32:52.2880929Z     
2025-05-07T20:32:52.2881178Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.2881508Z             op = silu_mul_quant
2025-05-07T20:32:52.2881776Z             if compiled:
2025-05-07T20:32:52.2882039Z                 op = torch.compile(op)
2025-05-07T20:32:52.2882344Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.2882747Z     
2025-05-07T20:32:52.2882955Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.2883123Z 
2025-05-07T20:32:52.2883232Z moe/activation_test.py:117: 
2025-05-07T20:32:52.2883550Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2883987Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.2884310Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.2885065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.2885804Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.2886379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.2887100Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.2887816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.2888382Z     kernel = self.compile(
2025-05-07T20:32:52.2888953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.2889642Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.2890070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2890309Z 
2025-05-07T20:32:52.2890530Z self = <triton.compiler.compiler.ASTSource object at 0x7f1295c4c1d0>
2025-05-07T20:32:52.2891662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.2893106Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1295e4a8e0>}
2025-05-07T20:32:52.2894657Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.2895759Z context = <triton._C.libtriton.ir.context object at 0x7f127b091130>
2025-05-07T20:32:52.2896061Z 
2025-05-07T20:32:52.2896242Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.2896790Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.2897290Z                            module_map=module_map)
2025-05-07T20:32:52.2897682Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.2898069Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.2898345Z E       ^
2025-05-07T20:32:52.2898844Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.2899329Z 
2025-05-07T20:32:52.2899782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.2900332Z 
2025-05-07T20:32:52.2900444Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.2900895Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.2901335Z     T=1,
2025-05-07T20:32:52.2901539Z     D=7168,
2025-05-07T20:32:52.2901741Z     scale_ub=None,
2025-05-07T20:32:52.2901978Z     contiguous=True,
2025-05-07T20:32:52.2902226Z     compiled=True,
2025-05-07T20:32:52.2902442Z )
2025-05-07T20:32:52.2902793Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.2903317Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:52.2903591Z 
2025-05-07T20:32:52.2903674Z     @given(
2025-05-07T20:32:52.2903920Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.2904349Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.2904673Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.2905023Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.2905379Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.2905765Z     )
2025-05-07T20:32:52.2906134Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.2906611Z     def test_silu_mul_quant(
2025-05-07T20:32:52.2906880Z         self,
2025-05-07T20:32:52.2907089Z         T: int,
2025-05-07T20:32:52.2907298Z         D: int,
2025-05-07T20:32:52.2907541Z         scale_ub: Optional[float],
2025-05-07T20:32:52.2907833Z         contiguous: bool,
2025-05-07T20:32:52.2908093Z         compiled: bool,
2025-05-07T20:32:52.2908329Z     ) -> None:
2025-05-07T20:32:52.2908550Z         torch.manual_seed(2025)
2025-05-07T20:32:52.2908803Z     
2025-05-07T20:32:52.2909103Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.2909461Z     
2025-05-07T20:32:52.2909668Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.2909974Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.2910296Z         x = x_sign * x_clamp
2025-05-07T20:32:52.2910563Z         x0 = x[:, :D]
2025-05-07T20:32:52.2910803Z         x1 = x[:, D:]
2025-05-07T20:32:52.2911028Z     
2025-05-07T20:32:52.2911221Z         if contiguous:
2025-05-07T20:32:52.2911473Z             x0 = x0.contiguous()
2025-05-07T20:32:52.2911749Z             x1 = x1.contiguous()
2025-05-07T20:32:52.2911993Z     
2025-05-07T20:32:52.2912199Z         if scale_ub is not None:
2025-05-07T20:32:52.2912488Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.2912835Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.2913166Z             )
2025-05-07T20:32:52.2913369Z         else:
2025-05-07T20:32:52.2913580Z             scale_ub_tensor = None
2025-05-07T20:32:52.2913869Z     
2025-05-07T20:32:52.2914142Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.2914466Z             op = silu_mul_quant
2025-05-07T20:32:52.2914730Z             if compiled:
2025-05-07T20:32:52.2914995Z                 op = torch.compile(op)
2025-05-07T20:32:52.2915306Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.2915599Z     
2025-05-07T20:32:52.2915813Z         y_fp8, y_scale = fn()
2025-05-07T20:32:52.2916104Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:52.2916413Z     
2025-05-07T20:32:52.2916661Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.2917014Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:52.2917315Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:52.2917644Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:52.2918021Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.2918341Z     
2025-05-07T20:32:52.2918561Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:52.2918766Z 
2025-05-07T20:32:52.2918880Z moe/activation_test.py:126: 
2025-05-07T20:32:52.2919184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2919536Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:52.2919887Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.2920727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:52.2921524Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:52.2922108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.2922838Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.2923574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:52.2924424Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.2925210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:52.2926434Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:52.2927077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:52.2927643Z     fn()
2025-05-07T20:32:52.2928188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:52.2928815Z     self.fn.run(
2025-05-07T20:32:52.2929310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.2929879Z     kernel = self.compile(
2025-05-07T20:32:52.2930459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.2931155Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.2931569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2931823Z 
2025-05-07T20:32:52.2932039Z self = <triton.compiler.compiler.ASTSource object at 0x7f129679e900>
2025-05-07T20:32:52.2933187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.2934781Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f129587c860>}
2025-05-07T20:32:52.2936202Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.2937307Z context = <triton._C.libtriton.ir.context object at 0x7f127ae28d70>
2025-05-07T20:32:52.2937622Z 
2025-05-07T20:32:52.2937799Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.2938359Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.2938846Z                            module_map=module_map)
2025-05-07T20:32:52.2939237Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.2939614Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:52.2939893Z E       ^
2025-05-07T20:32:52.2940386Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.2940869Z 
2025-05-07T20:32:52.2941317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.2941864Z 
2025-05-07T20:32:52.2941980Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.2942412Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.2942846Z     T=4096,
2025-05-07T20:32:52.2943054Z     D=5120,
2025-05-07T20:32:52.2943255Z     scale_ub=None,
2025-05-07T20:32:52.2943493Z     contiguous=False,
2025-05-07T20:32:52.2943732Z     compiled=False,
2025-05-07T20:32:52.2943942Z )
2025-05-07T20:32:52.2944284Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.2944813Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:52.2945103Z 
2025-05-07T20:32:52.2945193Z     @given(
2025-05-07T20:32:52.2945435Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.2945768Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.2946097Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.2946644Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.2946994Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.2947298Z     )
2025-05-07T20:32:52.2947658Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.2948902Z     def test_silu_mul_quant(
2025-05-07T20:32:52.2949167Z         self,
2025-05-07T20:32:52.2949380Z         T: int,
2025-05-07T20:32:52.2949585Z         D: int,
2025-05-07T20:32:52.2949819Z         scale_ub: Optional[float],
2025-05-07T20:32:52.2950108Z         contiguous: bool,
2025-05-07T20:32:52.2950357Z         compiled: bool,
2025-05-07T20:32:52.2950596Z     ) -> None:
2025-05-07T20:32:52.2950820Z         torch.manual_seed(2025)
2025-05-07T20:32:52.2951069Z     
2025-05-07T20:32:52.2951354Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.2951717Z     
2025-05-07T20:32:52.2951912Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.2952219Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.2952539Z         x = x_sign * x_clamp
2025-05-07T20:32:52.2952779Z         x0 = x[:, :D]
2025-05-07T20:32:52.2953002Z         x1 = x[:, D:]
2025-05-07T20:32:52.2953223Z     
2025-05-07T20:32:52.2953411Z         if contiguous:
2025-05-07T20:32:52.2953662Z             x0 = x0.contiguous()
2025-05-07T20:32:52.2953927Z             x1 = x1.contiguous()
2025-05-07T20:32:52.2954177Z     
2025-05-07T20:32:52.2954365Z         if scale_ub is not None:
2025-05-07T20:32:52.2954646Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.2954989Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.2955296Z             )
2025-05-07T20:32:52.2955501Z         else:
2025-05-07T20:32:52.2955722Z             scale_ub_tensor = None
2025-05-07T20:32:52.2955982Z     
2025-05-07T20:32:52.2956222Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.2956543Z             op = silu_mul_quant
2025-05-07T20:32:52.2956800Z             if compiled:
2025-05-07T20:32:52.2957054Z                 op = torch.compile(op)
2025-05-07T20:32:52.2957359Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.2957634Z     
2025-05-07T20:32:52.2959283Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.2959456Z 
2025-05-07T20:32:52.2959562Z moe/activation_test.py:117: 
2025-05-07T20:32:52.2959863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2960203Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.2960495Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.2961214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.2961942Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.2962507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.2963238Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.2963949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.2964514Z     kernel = self.compile(
2025-05-07T20:32:52.2965097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.2965798Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.2966211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2966461Z 
2025-05-07T20:32:52.2966673Z self = <triton.compiler.compiler.ASTSource object at 0x7f12959ffd10>
2025-05-07T20:32:52.2967812Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.2969342Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127b0a8180>}
2025-05-07T20:32:52.2970926Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.2972128Z context = <triton._C.libtriton.ir.context object at 0x7f127aef8230>
2025-05-07T20:32:52.2972448Z 
2025-05-07T20:32:52.2972626Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.2973191Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.2973714Z                            module_map=module_map)
2025-05-07T20:32:52.2974095Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.2974582Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.2974875Z E       ^
2025-05-07T20:32:52.2975365Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.2975849Z 
2025-05-07T20:32:52.2976296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.2976862Z 
2025-05-07T20:32:52.2976970Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.2977408Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.2977831Z     T=4096,
2025-05-07T20:32:52.2978039Z     D=7168,
2025-05-07T20:32:52.2978251Z     scale_ub=None,
2025-05-07T20:32:52.2978475Z     contiguous=False,
2025-05-07T20:32:52.2978721Z     compiled=False,
2025-05-07T20:32:52.2978944Z )
2025-05-07T20:32:52.2979271Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.2979797Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:52.2980084Z 
2025-05-07T20:32:52.2980175Z     @given(
2025-05-07T20:32:52.2980467Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.2980894Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.2981220Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.2981565Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.2981896Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.2982195Z     )
2025-05-07T20:32:52.2982549Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.2982998Z     def test_silu_mul_quant(
2025-05-07T20:32:52.2983246Z         self,
2025-05-07T20:32:52.2983448Z         T: int,
2025-05-07T20:32:52.2983641Z         D: int,
2025-05-07T20:32:52.2983868Z         scale_ub: Optional[float],
2025-05-07T20:32:52.2984154Z         contiguous: bool,
2025-05-07T20:32:52.2984429Z         compiled: bool,
2025-05-07T20:32:52.2984679Z     ) -> None:
2025-05-07T20:32:52.2984899Z         torch.manual_seed(2025)
2025-05-07T20:32:52.2985141Z     
2025-05-07T20:32:52.2985418Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.2985778Z     
2025-05-07T20:32:52.2985989Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.2986278Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.2986602Z         x = x_sign * x_clamp
2025-05-07T20:32:52.2986850Z         x0 = x[:, :D]
2025-05-07T20:32:52.2987067Z         x1 = x[:, D:]
2025-05-07T20:32:52.2987283Z     
2025-05-07T20:32:52.2987475Z         if contiguous:
2025-05-07T20:32:52.2987707Z             x0 = x0.contiguous()
2025-05-07T20:32:52.2987977Z             x1 = x1.contiguous()
2025-05-07T20:32:52.2988235Z     
2025-05-07T20:32:52.2988427Z         if scale_ub is not None:
2025-05-07T20:32:52.2988708Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.2989053Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.2989465Z             )
2025-05-07T20:32:52.2989675Z         else:
2025-05-07T20:32:52.2989904Z             scale_ub_tensor = None
2025-05-07T20:32:52.2990163Z     
2025-05-07T20:32:52.2990411Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.2990836Z             op = silu_mul_quant
2025-05-07T20:32:52.2991112Z             if compiled:
2025-05-07T20:32:52.2991375Z                 op = torch.compile(op)
2025-05-07T20:32:52.2991695Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.2992001Z     
2025-05-07T20:32:52.2992208Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.2992394Z 
2025-05-07T20:32:52.2992498Z moe/activation_test.py:117: 
2025-05-07T20:32:52.2992821Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2993167Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.2993478Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.2994276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.2995023Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.2995593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.2996336Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.2997054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.2997619Z     kernel = self.compile(
2025-05-07T20:32:52.2998201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.2998913Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.2999332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.2999572Z 
2025-05-07T20:32:52.2999792Z self = <triton.compiler.compiler.ASTSource object at 0x7f127b2f4410>
2025-05-07T20:32:52.3000924Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3002360Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127b0a9080>}
2025-05-07T20:32:52.3003775Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3004881Z context = <triton._C.libtriton.ir.context object at 0x7f127a843bf0>
2025-05-07T20:32:52.3005185Z 
2025-05-07T20:32:52.3005360Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3005914Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3006410Z                            module_map=module_map)
2025-05-07T20:32:52.3006578Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3006687Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3006777Z E       ^
2025-05-07T20:32:52.3007153Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3007158Z 
2025-05-07T20:32:52.3007607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3007611Z 
2025-05-07T20:32:52.3007719Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3007952Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3008046Z     T=128,
2025-05-07T20:32:52.3008130Z     D=7168,
2025-05-07T20:32:52.3008336Z     scale_ub=None,
2025-05-07T20:32:52.3008440Z     contiguous=False,
2025-05-07T20:32:52.3008528Z     compiled=True,
2025-05-07T20:32:52.3008619Z )
2025-05-07T20:32:52.3008849Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3009109Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.3009113Z 
2025-05-07T20:32:52.3009201Z     @given(
2025-05-07T20:32:52.3009325Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3009431Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3009558Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3009682Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3009799Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3009888Z     )
2025-05-07T20:32:52.3010143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3010252Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3010340Z         self,
2025-05-07T20:32:52.3010422Z         T: int,
2025-05-07T20:32:52.3010518Z         D: int,
2025-05-07T20:32:52.3010627Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3010725Z         contiguous: bool,
2025-05-07T20:32:52.3010839Z         compiled: bool,
2025-05-07T20:32:52.3010926Z     ) -> None:
2025-05-07T20:32:52.3011030Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3011117Z     
2025-05-07T20:32:52.3011296Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3011386Z     
2025-05-07T20:32:52.3011485Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3011614Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3011714Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3029379Z         x0 = x[:, :D]
2025-05-07T20:32:52.3029499Z         x1 = x[:, D:]
2025-05-07T20:32:52.3029583Z     
2025-05-07T20:32:52.3029676Z         if contiguous:
2025-05-07T20:32:52.3029783Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3029887Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3029965Z     
2025-05-07T20:32:52.3030059Z         if scale_ub is not None:
2025-05-07T20:32:52.3030180Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3030327Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3030406Z             )
2025-05-07T20:32:52.3030501Z         else:
2025-05-07T20:32:52.3030600Z             scale_ub_tensor = None
2025-05-07T20:32:52.3030675Z     
2025-05-07T20:32:52.3030823Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3030917Z             op = silu_mul_quant
2025-05-07T20:32:52.3031011Z             if compiled:
2025-05-07T20:32:52.3031116Z                 op = torch.compile(op)
2025-05-07T20:32:52.3031225Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3031309Z     
2025-05-07T20:32:52.3031404Z         y_fp8, y_scale = fn()
2025-05-07T20:32:52.3031534Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:52.3031619Z     
2025-05-07T20:32:52.3031760Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3031865Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:52.3031975Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:52.3032105Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:52.3032259Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.3032334Z     
2025-05-07T20:32:52.3032437Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:52.3032443Z 
2025-05-07T20:32:52.3032555Z moe/activation_test.py:126: 
2025-05-07T20:32:52.3032690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3032801Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:52.3032945Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.3033721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:52.3033837Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:52.3034261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3034614Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3035009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:52.3035274Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.3035670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:52.3035855Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:52.3036220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:52.3036311Z     fn()
2025-05-07T20:32:52.3036732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:52.3036817Z     self.fn.run(
2025-05-07T20:32:52.3037185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3037282Z     kernel = self.compile(
2025-05-07T20:32:52.3037682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3037876Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3038008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3038014Z 
2025-05-07T20:32:52.3038235Z self = <triton.compiler.compiler.ASTSource object at 0x7f127b0e0a70>
2025-05-07T20:32:52.3039050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3039576Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127b0a9f80>}
2025-05-07T20:32:52.3040374Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3040571Z context = <triton._C.libtriton.ir.context object at 0x7f127a5b8330>
2025-05-07T20:32:52.3040575Z 
2025-05-07T20:32:52.3040755Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3041030Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3041155Z                            module_map=module_map)
2025-05-07T20:32:52.3041326Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3041430Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:52.3041521Z E       ^
2025-05-07T20:32:52.3041893Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3041902Z 
2025-05-07T20:32:52.3042338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3042350Z 
2025-05-07T20:32:52.3042456Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3042683Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3042776Z     T=128,
2025-05-07T20:32:52.3042855Z     D=7168,
2025-05-07T20:32:52.3042939Z     scale_ub=None,
2025-05-07T20:32:52.3043035Z     contiguous=False,
2025-05-07T20:32:52.3043120Z     compiled=False,
2025-05-07T20:32:52.3043195Z )
2025-05-07T20:32:52.3043541Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3043721Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:52.3043726Z 
2025-05-07T20:32:52.3043815Z     @given(
2025-05-07T20:32:52.3044011Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3044112Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3044235Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3044356Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3044474Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3044556Z     )
2025-05-07T20:32:52.3044808Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3044906Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3044992Z         self,
2025-05-07T20:32:52.3045070Z         T: int,
2025-05-07T20:32:52.3045148Z         D: int,
2025-05-07T20:32:52.3045261Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3045352Z         contiguous: bool,
2025-05-07T20:32:52.3045444Z         compiled: bool,
2025-05-07T20:32:52.3045524Z     ) -> None:
2025-05-07T20:32:52.3045625Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3045720Z     
2025-05-07T20:32:52.3045899Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3045979Z     
2025-05-07T20:32:52.3046087Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3046217Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3046315Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3046406Z         x0 = x[:, :D]
2025-05-07T20:32:52.3046493Z         x1 = x[:, D:]
2025-05-07T20:32:52.3046571Z     
2025-05-07T20:32:52.3046669Z         if contiguous:
2025-05-07T20:32:52.3046771Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3046869Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3046961Z     
2025-05-07T20:32:52.3047063Z         if scale_ub is not None:
2025-05-07T20:32:52.3047182Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3047325Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3047403Z             )
2025-05-07T20:32:52.3047497Z         else:
2025-05-07T20:32:52.3047602Z             scale_ub_tensor = None
2025-05-07T20:32:52.3047677Z     
2025-05-07T20:32:52.3047818Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3047910Z             op = silu_mul_quant
2025-05-07T20:32:52.3047999Z             if compiled:
2025-05-07T20:32:52.3048117Z                 op = torch.compile(op)
2025-05-07T20:32:52.3048228Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3048305Z     
2025-05-07T20:32:52.3048405Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3048409Z 
2025-05-07T20:32:52.3048509Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3048647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3048756Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3048861Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3049397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3049503Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3049878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3050118Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3050475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3050578Z     kernel = self.compile(
2025-05-07T20:32:52.3050984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3051164Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3051398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3051404Z 
2025-05-07T20:32:52.3051615Z self = <triton.compiler.compiler.ASTSource object at 0x7f127a77d9d0>
2025-05-07T20:32:52.3052512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3053036Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ab52340>}
2025-05-07T20:32:52.3053831Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3054045Z context = <triton._C.libtriton.ir.context object at 0x7f127a444ab0>
2025-05-07T20:32:52.3054049Z 
2025-05-07T20:32:52.3054221Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3054684Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3054804Z                            module_map=module_map)
2025-05-07T20:32:52.3054970Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3055081Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3055164Z E       ^
2025-05-07T20:32:52.3055546Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3055551Z 
2025-05-07T20:32:52.3055986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3055991Z 
2025-05-07T20:32:52.3056097Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3056337Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3056419Z     T=4096,
2025-05-07T20:32:52.3056502Z     D=5120,
2025-05-07T20:32:52.3056598Z     scale_ub=1200.0,
2025-05-07T20:32:52.3056688Z     contiguous=True,
2025-05-07T20:32:52.3056794Z     compiled=False,
2025-05-07T20:32:52.3056875Z )
2025-05-07T20:32:52.3057102Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3057290Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:52.3057294Z 
2025-05-07T20:32:52.3057378Z     @given(
2025-05-07T20:32:52.3057498Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3057603Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3057719Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3057838Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3057961Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3058044Z     )
2025-05-07T20:32:52.3058310Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3058407Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3058484Z         self,
2025-05-07T20:32:52.3058575Z         T: int,
2025-05-07T20:32:52.3058654Z         D: int,
2025-05-07T20:32:52.3058752Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3058849Z         contiguous: bool,
2025-05-07T20:32:52.3058935Z         compiled: bool,
2025-05-07T20:32:52.3059014Z     ) -> None:
2025-05-07T20:32:52.3059120Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3059201Z     
2025-05-07T20:32:52.3059377Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3059467Z     
2025-05-07T20:32:52.3059566Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3059699Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3059796Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3059881Z         x0 = x[:, :D]
2025-05-07T20:32:52.3060086Z         x1 = x[:, D:]
2025-05-07T20:32:52.3060169Z     
2025-05-07T20:32:52.3060256Z         if contiguous:
2025-05-07T20:32:52.3060363Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3060454Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3060607Z     
2025-05-07T20:32:52.3060709Z         if scale_ub is not None:
2025-05-07T20:32:52.3060819Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3060958Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3061046Z             )
2025-05-07T20:32:52.3061131Z         else:
2025-05-07T20:32:52.3061237Z             scale_ub_tensor = None
2025-05-07T20:32:52.3061315Z     
2025-05-07T20:32:52.3061446Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3061547Z             op = silu_mul_quant
2025-05-07T20:32:52.3061630Z             if compiled:
2025-05-07T20:32:52.3061728Z                 op = torch.compile(op)
2025-05-07T20:32:52.3061848Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3061924Z     
2025-05-07T20:32:52.3062022Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3062027Z 
2025-05-07T20:32:52.3062132Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3062263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3062369Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3062478Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3063004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3063111Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3063487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3063723Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3064093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3064191Z     kernel = self.compile(
2025-05-07T20:32:52.3064600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3064789Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3064922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3064926Z 
2025-05-07T20:32:52.3065145Z self = <triton.compiler.compiler.ASTSource object at 0x7f127a77e240>
2025-05-07T20:32:52.3065951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3066479Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ab51440>}
2025-05-07T20:32:52.3067267Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3067467Z context = <triton._C.libtriton.ir.context object at 0x7f127a49fa70>
2025-05-07T20:32:52.3067472Z 
2025-05-07T20:32:52.3067649Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3067920Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3068037Z                            module_map=module_map)
2025-05-07T20:32:52.3068199Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3068298Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3068386Z E       ^
2025-05-07T20:32:52.3068841Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3068846Z 
2025-05-07T20:32:52.3069294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3069299Z 
2025-05-07T20:32:52.3069484Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3069720Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3069813Z     T=1,
2025-05-07T20:32:52.3069898Z     D=5120,
2025-05-07T20:32:52.3069988Z     scale_ub=None,
2025-05-07T20:32:52.3070085Z     contiguous=True,
2025-05-07T20:32:52.3070175Z     compiled=True,
2025-05-07T20:32:52.3070255Z )
2025-05-07T20:32:52.3070486Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3070654Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:52.3070659Z 
2025-05-07T20:32:52.3070747Z     @given(
2025-05-07T20:32:52.3070876Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3070981Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3071108Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3071232Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3071356Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3071444Z     )
2025-05-07T20:32:52.3071699Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3071798Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3071890Z         self,
2025-05-07T20:32:52.3071972Z         T: int,
2025-05-07T20:32:52.3072063Z         D: int,
2025-05-07T20:32:52.3072164Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3072258Z         contiguous: bool,
2025-05-07T20:32:52.3072358Z         compiled: bool,
2025-05-07T20:32:52.3072440Z     ) -> None:
2025-05-07T20:32:52.3072534Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3072616Z     
2025-05-07T20:32:52.3072792Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3072867Z     
2025-05-07T20:32:52.3072966Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3073091Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3073182Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3073276Z         x0 = x[:, :D]
2025-05-07T20:32:52.3073358Z         x1 = x[:, D:]
2025-05-07T20:32:52.3073431Z     
2025-05-07T20:32:52.3073521Z         if contiguous:
2025-05-07T20:32:52.3073615Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3073712Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3073786Z     
2025-05-07T20:32:52.3073877Z         if scale_ub is not None:
2025-05-07T20:32:52.3073994Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3074129Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3074202Z             )
2025-05-07T20:32:52.3074294Z         else:
2025-05-07T20:32:52.3074405Z             scale_ub_tensor = None
2025-05-07T20:32:52.3074496Z     
2025-05-07T20:32:52.3074648Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3074741Z             op = silu_mul_quant
2025-05-07T20:32:52.3074827Z             if compiled:
2025-05-07T20:32:52.3074935Z                 op = torch.compile(op)
2025-05-07T20:32:52.3075046Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3075125Z     
2025-05-07T20:32:52.3075217Z         y_fp8, y_scale = fn()
2025-05-07T20:32:52.3075341Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:52.3075423Z     
2025-05-07T20:32:52.3075560Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3075663Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:52.3075774Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:52.3075898Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:52.3076040Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.3076126Z     
2025-05-07T20:32:52.3076353Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:52.3076358Z 
2025-05-07T20:32:52.3076470Z moe/activation_test.py:126: 
2025-05-07T20:32:52.3076603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3076786Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:52.3076930Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.3077518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:52.3077621Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:52.3078003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3078233Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3078639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:52.3078906Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.3079301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:52.3079486Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:52.3079849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:52.3079939Z     fn()
2025-05-07T20:32:52.3080363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:52.3080449Z     self.fn.run(
2025-05-07T20:32:52.3080816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3080911Z     kernel = self.compile(
2025-05-07T20:32:52.3081313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3081498Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3081628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3081638Z 
2025-05-07T20:32:52.3081853Z self = <triton.compiler.compiler.ASTSource object at 0x7f1295a78ec0>
2025-05-07T20:32:52.3082659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3083172Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ab53060>}
2025-05-07T20:32:52.3083978Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3084183Z context = <triton._C.libtriton.ir.context object at 0x7f127a4b1170>
2025-05-07T20:32:52.3084189Z 
2025-05-07T20:32:52.3084392Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3084676Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3084780Z                            module_map=module_map)
2025-05-07T20:32:52.3084948Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3085051Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:52.3085132Z E       ^
2025-05-07T20:32:52.3085495Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3085500Z 
2025-05-07T20:32:52.3086012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3086017Z 
2025-05-07T20:32:52.3086131Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3086356Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3086543Z     T=2048,
2025-05-07T20:32:52.3086621Z     D=5120,
2025-05-07T20:32:52.3086704Z     scale_ub=None,
2025-05-07T20:32:52.3086797Z     contiguous=True,
2025-05-07T20:32:52.3086879Z     compiled=True,
2025-05-07T20:32:52.3086950Z )
2025-05-07T20:32:52.3087184Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3087358Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:52.3087363Z 
2025-05-07T20:32:52.3087442Z     @given(
2025-05-07T20:32:52.3087564Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3087667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3087782Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3087915Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3088026Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3088111Z     )
2025-05-07T20:32:52.3088359Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3088461Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3088548Z         self,
2025-05-07T20:32:52.3088625Z         T: int,
2025-05-07T20:32:52.3088704Z         D: int,
2025-05-07T20:32:52.3088813Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3088900Z         contiguous: bool,
2025-05-07T20:32:52.3088988Z         compiled: bool,
2025-05-07T20:32:52.3089077Z     ) -> None:
2025-05-07T20:32:52.3089169Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3089245Z     
2025-05-07T20:32:52.3089427Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3089505Z     
2025-05-07T20:32:52.3089611Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3089742Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3089830Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3089912Z         x0 = x[:, :D]
2025-05-07T20:32:52.3089990Z         x1 = x[:, D:]
2025-05-07T20:32:52.3090062Z     
2025-05-07T20:32:52.3090161Z         if contiguous:
2025-05-07T20:32:52.3090250Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3090339Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3090417Z     
2025-05-07T20:32:52.3090509Z         if scale_ub is not None:
2025-05-07T20:32:52.3090616Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3090754Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3090829Z             )
2025-05-07T20:32:52.3090908Z         else:
2025-05-07T20:32:52.3090998Z             scale_ub_tensor = None
2025-05-07T20:32:52.3091065Z     
2025-05-07T20:32:52.3091199Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3091292Z             op = silu_mul_quant
2025-05-07T20:32:52.3091374Z             if compiled:
2025-05-07T20:32:52.3091471Z                 op = torch.compile(op)
2025-05-07T20:32:52.3091581Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3091654Z     
2025-05-07T20:32:52.3091750Z         y_fp8, y_scale = fn()
2025-05-07T20:32:52.3091880Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:52.3091952Z     
2025-05-07T20:32:52.3092090Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3092203Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:52.3092303Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:52.3092423Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:52.3092576Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.3092651Z     
2025-05-07T20:32:52.3092759Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:52.3092764Z 
2025-05-07T20:32:52.3092859Z moe/activation_test.py:126: 
2025-05-07T20:32:52.3093076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3093188Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:52.3093321Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.3093987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:52.3094098Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:52.3094557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3094799Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3095186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:52.3095459Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.3095862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:52.3096033Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:52.3096406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:52.3096488Z     fn()
2025-05-07T20:32:52.3096907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:52.3097029Z     self.fn.run(
2025-05-07T20:32:52.3097417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3097542Z     kernel = self.compile(
2025-05-07T20:32:52.3098070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3098371Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3098537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3098542Z 
2025-05-07T20:32:52.3098822Z self = <triton.compiler.compiler.ASTSource object at 0x7f127ad348c0>
2025-05-07T20:32:52.3099666Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3100196Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1279cf4860>}
2025-05-07T20:32:52.3101149Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3101386Z context = <triton._C.libtriton.ir.context object at 0x7f127a3709f0>
2025-05-07T20:32:52.3101391Z 
2025-05-07T20:32:52.3101633Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3101944Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3102114Z                            module_map=module_map)
2025-05-07T20:32:52.3102384Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3102559Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:52.3102707Z E       ^
2025-05-07T20:32:52.3103117Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3103122Z 
2025-05-07T20:32:52.3103591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3103637Z 
2025-05-07T20:32:52.3103795Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3104220Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3104385Z     T=128,
2025-05-07T20:32:52.3104494Z     D=5120,
2025-05-07T20:32:52.3104613Z     scale_ub=None,
2025-05-07T20:32:52.3104771Z     contiguous=True,
2025-05-07T20:32:52.3104975Z     compiled=True,
2025-05-07T20:32:52.3105136Z )
2025-05-07T20:32:52.3105447Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3105651Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:52.3105656Z 
2025-05-07T20:32:52.3105819Z     @given(
2025-05-07T20:32:52.3105975Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3106092Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3106343Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3106494Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3106646Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3106825Z     )
2025-05-07T20:32:52.3107115Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3107321Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3107456Z         self,
2025-05-07T20:32:52.3107571Z         T: int,
2025-05-07T20:32:52.3107770Z         D: int,
2025-05-07T20:32:52.3107901Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3108024Z         contiguous: bool,
2025-05-07T20:32:52.3108217Z         compiled: bool,
2025-05-07T20:32:52.3108348Z     ) -> None:
2025-05-07T20:32:52.3108477Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3108644Z     
2025-05-07T20:32:52.3108850Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3108954Z     
2025-05-07T20:32:52.3109154Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3109332Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3109512Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3109629Z         x0 = x[:, :D]
2025-05-07T20:32:52.3109751Z         x1 = x[:, D:]
2025-05-07T20:32:52.3109882Z     
2025-05-07T20:32:52.3110053Z         if contiguous:
2025-05-07T20:32:52.3110198Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3110380Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3110495Z     
2025-05-07T20:32:52.3110621Z         if scale_ub is not None:
2025-05-07T20:32:52.3110781Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3111005Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3111183Z             )
2025-05-07T20:32:52.3111293Z         else:
2025-05-07T20:32:52.3111463Z             scale_ub_tensor = None
2025-05-07T20:32:52.3111602Z     
2025-05-07T20:32:52.3111745Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3111921Z             op = silu_mul_quant
2025-05-07T20:32:52.3112107Z             if compiled:
2025-05-07T20:32:52.3112238Z                 op = torch.compile(op)
2025-05-07T20:32:52.3112378Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3112518Z     
2025-05-07T20:32:52.3112624Z         y_fp8, y_scale = fn()
2025-05-07T20:32:52.3112921Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:52.3113026Z     
2025-05-07T20:32:52.3113199Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3113376Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:52.3113509Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:52.3113648Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:52.3114001Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.3114112Z     
2025-05-07T20:32:52.3114244Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:52.3114249Z 
2025-05-07T20:32:52.3114421Z moe/activation_test.py:126: 
2025-05-07T20:32:52.3114586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3114908Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:52.3115101Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.3115769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:52.3116025Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:52.3116439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3116745Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3117216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:52.3117538Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.3118008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:52.3118219Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:52.3118639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:52.3118772Z     fn()
2025-05-07T20:32:52.3119282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:52.3119455Z     self.fn.run(
2025-05-07T20:32:52.3119845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3119975Z     kernel = self.compile(
2025-05-07T20:32:52.3120470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3120667Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3120937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3120942Z 
2025-05-07T20:32:52.3121197Z self = <triton.compiler.compiler.ASTSource object at 0x7f1279c9a8a0>
2025-05-07T20:32:52.3122046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3122661Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127a6a5f80>}
2025-05-07T20:32:52.3123483Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3123781Z context = <triton._C.libtriton.ir.context object at 0x7f127a13d170>
2025-05-07T20:32:52.3123786Z 
2025-05-07T20:32:52.3124061Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3124409Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3124574Z                            module_map=module_map)
2025-05-07T20:32:52.3124778Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3124931Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:52.3125089Z E       ^
2025-05-07T20:32:52.3125731Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3125739Z 
2025-05-07T20:32:52.3126322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3126327Z 
2025-05-07T20:32:52.3126467Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3126761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3126858Z     T=4096,
2025-05-07T20:32:52.3127170Z     D=5120,
2025-05-07T20:32:52.3127361Z     scale_ub=None,
2025-05-07T20:32:52.3127482Z     contiguous=True,
2025-05-07T20:32:52.3127595Z     compiled=True,
2025-05-07T20:32:52.3127736Z )
2025-05-07T20:32:52.3127977Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3128422Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:52.3128427Z 
2025-05-07T20:32:52.3128539Z     @given(
2025-05-07T20:32:52.3128691Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3128863Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3129093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3129229Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3129505Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3129617Z     )
2025-05-07T20:32:52.3129946Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3130080Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3130192Z         self,
2025-05-07T20:32:52.3130394Z         T: int,
2025-05-07T20:32:52.3130521Z         D: int,
2025-05-07T20:32:52.3130657Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3130820Z         contiguous: bool,
2025-05-07T20:32:52.3130939Z         compiled: bool,
2025-05-07T20:32:52.3131050Z     ) -> None:
2025-05-07T20:32:52.3131273Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3131392Z     
2025-05-07T20:32:52.3131596Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3131743Z     
2025-05-07T20:32:52.3131873Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3132075Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3132248Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3132380Z         x0 = x[:, :D]
2025-05-07T20:32:52.3132528Z         x1 = x[:, D:]
2025-05-07T20:32:52.3132636Z     
2025-05-07T20:32:52.3132758Z         if contiguous:
2025-05-07T20:32:52.3133003Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3133180Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3133302Z     
2025-05-07T20:32:52.3133462Z         if scale_ub is not None:
2025-05-07T20:32:52.3133600Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3133837Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3133969Z             )
2025-05-07T20:32:52.3134158Z         else:
2025-05-07T20:32:52.3134338Z             scale_ub_tensor = None
2025-05-07T20:32:52.3134526Z     
2025-05-07T20:32:52.3134691Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3134871Z             op = silu_mul_quant
2025-05-07T20:32:52.3134972Z             if compiled:
2025-05-07T20:32:52.3135175Z                 op = torch.compile(op)
2025-05-07T20:32:52.3135371Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3135476Z     
2025-05-07T20:32:52.3135622Z         y_fp8, y_scale = fn()
2025-05-07T20:32:52.3135820Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:52.3135908Z     
2025-05-07T20:32:52.3136181Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3136316Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:52.3136456Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:52.3136762Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:52.3136935Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.3137027Z     
2025-05-07T20:32:52.3137264Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:52.3137269Z 
2025-05-07T20:32:52.3137400Z moe/activation_test.py:126: 
2025-05-07T20:32:52.3137622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3137762Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:52.3137930Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.3138714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:52.3138867Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:52.3139335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3139673Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3140093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:52.3140413Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.3140891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:52.3141131Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:52.3141567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:52.3141678Z     fn()
2025-05-07T20:32:52.3142170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:52.3142277Z     self.fn.run(
2025-05-07T20:32:52.3142736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3142916Z     kernel = self.compile(
2025-05-07T20:32:52.3143443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3143687Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3143848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3143853Z 
2025-05-07T20:32:52.3144079Z self = <triton.compiler.compiler.ASTSource object at 0x7f1279fc8c20>
2025-05-07T20:32:52.3145054Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3145601Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127a002520>}
2025-05-07T20:32:52.3146464Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3146691Z context = <triton._C.libtriton.ir.context object at 0x7f127985df30>
2025-05-07T20:32:52.3146696Z 
2025-05-07T20:32:52.3146895Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3147298Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3147464Z                            module_map=module_map)
2025-05-07T20:32:52.3147697Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3147834Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:52.3147942Z E       ^
2025-05-07T20:32:52.3148395Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3148399Z 
2025-05-07T20:32:52.3148916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3148921Z 
2025-05-07T20:32:52.3149109Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3149369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3149481Z     T=16384,
2025-05-07T20:32:52.3149651Z     D=5120,
2025-05-07T20:32:52.3149753Z     scale_ub=None,
2025-05-07T20:32:52.3149921Z     contiguous=True,
2025-05-07T20:32:52.3150270Z     compiled=True,
2025-05-07T20:32:52.3150378Z )
2025-05-07T20:32:52.3150692Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3150899Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:52.3150979Z 
2025-05-07T20:32:52.3151074Z     @given(
2025-05-07T20:32:52.3151331Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3151467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3151641Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3151825Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3151971Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3152159Z     )
2025-05-07T20:32:52.3152463Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3152612Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3152756Z         self,
2025-05-07T20:32:52.3152869Z         T: int,
2025-05-07T20:32:52.3152983Z         D: int,
2025-05-07T20:32:52.3153201Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3153340Z         contiguous: bool,
2025-05-07T20:32:52.3153481Z         compiled: bool,
2025-05-07T20:32:52.3153634Z     ) -> None:
2025-05-07T20:32:52.3153784Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3153901Z     
2025-05-07T20:32:52.3154262Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3154408Z     
2025-05-07T20:32:52.3154567Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3154724Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3154848Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3154989Z         x0 = x[:, :D]
2025-05-07T20:32:52.3155153Z         x1 = x[:, D:]
2025-05-07T20:32:52.3155301Z     
2025-05-07T20:32:52.3155468Z         if contiguous:
2025-05-07T20:32:52.3155597Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3155721Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3155856Z     
2025-05-07T20:32:52.3156038Z         if scale_ub is not None:
2025-05-07T20:32:52.3156264Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3156437Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3156550Z             )
2025-05-07T20:32:52.3156705Z         else:
2025-05-07T20:32:52.3156818Z             scale_ub_tensor = None
2025-05-07T20:32:52.3156994Z     
2025-05-07T20:32:52.3157214Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3157340Z             op = silu_mul_quant
2025-05-07T20:32:52.3157461Z             if compiled:
2025-05-07T20:32:52.3157635Z                 op = torch.compile(op)
2025-05-07T20:32:52.3157759Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3158062Z     
2025-05-07T20:32:52.3158187Z         y_fp8, y_scale = fn()
2025-05-07T20:32:52.3158344Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:52.3158492Z     
2025-05-07T20:32:52.3158674Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3158817Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:52.3159056Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:52.3159213Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:52.3159397Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.3159548Z     
2025-05-07T20:32:52.3159708Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:52.3159713Z 
2025-05-07T20:32:52.3159914Z moe/activation_test.py:126: 
2025-05-07T20:32:52.3160101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3160242Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:52.3160448Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.3161072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:52.3161316Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:52.3161799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3162080Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3162650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:52.3162951Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.3163443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:52.3163668Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:52.3164133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:52.3164317Z     fn()
2025-05-07T20:32:52.3164795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:52.3164913Z     self.fn.run(
2025-05-07T20:32:52.3165340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3165457Z     kernel = self.compile(
2025-05-07T20:32:52.3165993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3166228Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3166391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3166396Z 
2025-05-07T20:32:52.3166670Z self = <triton.compiler.compiler.ASTSource object at 0x7f127a008c80>
2025-05-07T20:32:52.3167516Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3168133Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1279963420>}
2025-05-07T20:32:52.3168979Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3169226Z context = <triton._C.libtriton.ir.context object at 0x7f1278f8efb0>
2025-05-07T20:32:52.3169269Z 
2025-05-07T20:32:52.3169469Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3169772Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3169931Z                            module_map=module_map)
2025-05-07T20:32:52.3170183Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3170353Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:52.3170501Z E       ^
2025-05-07T20:32:52.3170973Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3170983Z 
2025-05-07T20:32:52.3171485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3171490Z 
2025-05-07T20:32:52.3171613Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3171926Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3172114Z     T=1,
2025-05-07T20:32:52.3172224Z     D=5120,
2025-05-07T20:32:52.3172341Z     scale_ub=1200.0,
2025-05-07T20:32:52.3172497Z     contiguous=True,
2025-05-07T20:32:52.3172598Z     compiled=True,
2025-05-07T20:32:52.3172829Z )
2025-05-07T20:32:52.3173086Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3173369Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:52.3173374Z 
2025-05-07T20:32:52.3173521Z     @given(
2025-05-07T20:32:52.3173678Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3173893Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3174167Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3174318Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3174638Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3174770Z     )
2025-05-07T20:32:52.3175054Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3175265Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3175391Z         self,
2025-05-07T20:32:52.3175564Z         T: int,
2025-05-07T20:32:52.3175712Z         D: int,
2025-05-07T20:32:52.3175869Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3175990Z         contiguous: bool,
2025-05-07T20:32:52.3176182Z         compiled: bool,
2025-05-07T20:32:52.3176308Z     ) -> None:
2025-05-07T20:32:52.3176434Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3176599Z     
2025-05-07T20:32:52.3176802Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3176935Z     
2025-05-07T20:32:52.3177106Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3177278Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3177435Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3177572Z         x0 = x[:, :D]
2025-05-07T20:32:52.3177686Z         x1 = x[:, D:]
2025-05-07T20:32:52.3177790Z     
2025-05-07T20:32:52.3182924Z         if contiguous:
2025-05-07T20:32:52.3183038Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3183136Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3183217Z     
2025-05-07T20:32:52.3183313Z         if scale_ub is not None:
2025-05-07T20:32:52.3183435Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3183586Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3183668Z             )
2025-05-07T20:32:52.3183754Z         else:
2025-05-07T20:32:52.3183852Z             scale_ub_tensor = None
2025-05-07T20:32:52.3183936Z     
2025-05-07T20:32:52.3184079Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3184174Z             op = silu_mul_quant
2025-05-07T20:32:52.3184263Z             if compiled:
2025-05-07T20:32:52.3184367Z                 op = torch.compile(op)
2025-05-07T20:32:52.3184475Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3184553Z     
2025-05-07T20:32:52.3184647Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3184652Z 
2025-05-07T20:32:52.3184758Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3184893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3184996Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3185113Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3185511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3185608Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3186133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3186241Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3186624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3186852Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3187208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3187310Z     kernel = self.compile(
2025-05-07T20:32:52.3187712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3187996Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3188140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3188145Z 
2025-05-07T20:32:52.3188429Z self = <triton.compiler.compiler.ASTSource object at 0x7f1295c4f560>
2025-05-07T20:32:52.3189244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3189765Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1279470180>}
2025-05-07T20:32:52.3190566Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3190760Z context = <triton._C.libtriton.ir.context object at 0x7f12784b8a70>
2025-05-07T20:32:52.3190765Z 
2025-05-07T20:32:52.3190933Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3191212Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3191323Z                            module_map=module_map)
2025-05-07T20:32:52.3191497Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3191599Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3191677Z E       ^
2025-05-07T20:32:52.3192052Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3192057Z 
2025-05-07T20:32:52.3192493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3192497Z 
2025-05-07T20:32:52.3192609Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3192843Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3192925Z     T=1,
2025-05-07T20:32:52.3193006Z     D=5120,
2025-05-07T20:32:52.3193101Z     scale_ub=None,
2025-05-07T20:32:52.3193189Z     contiguous=False,
2025-05-07T20:32:52.3193281Z     compiled=True,
2025-05-07T20:32:52.3193355Z )
2025-05-07T20:32:52.3193579Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3193751Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.3193756Z 
2025-05-07T20:32:52.3193833Z     @given(
2025-05-07T20:32:52.3193955Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3194063Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3194182Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3194311Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3194430Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3194507Z     )
2025-05-07T20:32:52.3194765Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3194863Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3194950Z         self,
2025-05-07T20:32:52.3195037Z         T: int,
2025-05-07T20:32:52.3195116Z         D: int,
2025-05-07T20:32:52.3195217Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3195315Z         contiguous: bool,
2025-05-07T20:32:52.3195403Z         compiled: bool,
2025-05-07T20:32:52.3195484Z     ) -> None:
2025-05-07T20:32:52.3195583Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3195658Z     
2025-05-07T20:32:52.3195836Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3195911Z     
2025-05-07T20:32:52.3196008Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3196138Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3196314Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3196402Z         x0 = x[:, :D]
2025-05-07T20:32:52.3196497Z         x1 = x[:, D:]
2025-05-07T20:32:52.3196576Z     
2025-05-07T20:32:52.3196668Z         if contiguous:
2025-05-07T20:32:52.3196766Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3196959Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3197033Z     
2025-05-07T20:32:52.3197128Z         if scale_ub is not None:
2025-05-07T20:32:52.3197237Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3197377Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3197450Z             )
2025-05-07T20:32:52.3197526Z         else:
2025-05-07T20:32:52.3197628Z             scale_ub_tensor = None
2025-05-07T20:32:52.3197702Z     
2025-05-07T20:32:52.3197838Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3197935Z             op = silu_mul_quant
2025-05-07T20:32:52.3198020Z             if compiled:
2025-05-07T20:32:52.3198127Z                 op = torch.compile(op)
2025-05-07T20:32:52.3198241Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3198316Z     
2025-05-07T20:32:52.3198408Z         y_fp8, y_scale = fn()
2025-05-07T20:32:52.3198535Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:52.3198615Z     
2025-05-07T20:32:52.3198756Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3198870Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:52.3198973Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:52.3199103Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:52.3199246Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.3199322Z     
2025-05-07T20:32:52.3199428Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:52.3199432Z 
2025-05-07T20:32:52.3199531Z moe/activation_test.py:126: 
2025-05-07T20:32:52.3199671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3199786Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:52.3199924Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.3200523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:52.3200632Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:52.3201014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3201253Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3201779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:52.3202069Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.3202471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:52.3202639Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:52.3203003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:52.3203089Z     fn()
2025-05-07T20:32:52.3203511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:52.3203603Z     self.fn.run(
2025-05-07T20:32:52.3203960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3204058Z     kernel = self.compile(
2025-05-07T20:32:52.3204487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3204686Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3204917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3204922Z 
2025-05-07T20:32:52.3205128Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278d24d40>
2025-05-07T20:32:52.3205945Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3206538Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1279473240>}
2025-05-07T20:32:52.3207333Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3207534Z context = <triton._C.libtriton.ir.context object at 0x7f1278450e30>
2025-05-07T20:32:52.3207539Z 
2025-05-07T20:32:52.3207714Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3207995Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3208104Z                            module_map=module_map)
2025-05-07T20:32:52.3208274Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3208389Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:52.3208473Z E       ^
2025-05-07T20:32:52.3208857Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3208869Z 
2025-05-07T20:32:52.3209307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3209312Z 
2025-05-07T20:32:52.3209421Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3209653Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3209741Z     T=1,
2025-05-07T20:32:52.3209822Z     D=5120,
2025-05-07T20:32:52.3209912Z     scale_ub=None,
2025-05-07T20:32:52.3210004Z     contiguous=True,
2025-05-07T20:32:52.3210090Z     compiled=False,
2025-05-07T20:32:52.3210179Z )
2025-05-07T20:32:52.3210410Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3210587Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:52.3210591Z 
2025-05-07T20:32:52.3210676Z     @given(
2025-05-07T20:32:52.3210799Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3210912Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3211032Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3211151Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3211274Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3211355Z     )
2025-05-07T20:32:52.3211614Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3211724Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3211809Z         self,
2025-05-07T20:32:52.3211898Z         T: int,
2025-05-07T20:32:52.3211980Z         D: int,
2025-05-07T20:32:52.3212089Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3212201Z         contiguous: bool,
2025-05-07T20:32:52.3212319Z         compiled: bool,
2025-05-07T20:32:52.3212423Z     ) -> None:
2025-05-07T20:32:52.3212554Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3212649Z     
2025-05-07T20:32:52.3212848Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3212926Z     
2025-05-07T20:32:52.3213017Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3213139Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3213265Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3213375Z         x0 = x[:, :D]
2025-05-07T20:32:52.3213492Z         x1 = x[:, D:]
2025-05-07T20:32:52.3213685Z     
2025-05-07T20:32:52.3213810Z         if contiguous:
2025-05-07T20:32:52.3213925Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3214016Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3214089Z     
2025-05-07T20:32:52.3214184Z         if scale_ub is not None:
2025-05-07T20:32:52.3214520Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3214668Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3214754Z             )
2025-05-07T20:32:52.3214836Z         else:
2025-05-07T20:32:52.3214935Z             scale_ub_tensor = None
2025-05-07T20:32:52.3215018Z     
2025-05-07T20:32:52.3215146Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3215235Z             op = silu_mul_quant
2025-05-07T20:32:52.3215324Z             if compiled:
2025-05-07T20:32:52.3215423Z                 op = torch.compile(op)
2025-05-07T20:32:52.3215533Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3215605Z     
2025-05-07T20:32:52.3215698Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3215703Z 
2025-05-07T20:32:52.3215803Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3215933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3216035Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3216140Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3216668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3216772Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3217146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3217374Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3217734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3217830Z     kernel = self.compile(
2025-05-07T20:32:52.3218228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3218411Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3218545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3218550Z 
2025-05-07T20:32:52.3218760Z self = <triton.compiler.compiler.ASTSource object at 0x7f127949a750>
2025-05-07T20:32:52.3219567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3220081Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127a001bc0>}
2025-05-07T20:32:52.3220878Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3221069Z context = <triton._C.libtriton.ir.context object at 0x7f1278444a70>
2025-05-07T20:32:52.3221079Z 
2025-05-07T20:32:52.3221254Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3221523Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3221632Z                            module_map=module_map)
2025-05-07T20:32:52.3221795Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3221896Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3221978Z E       ^
2025-05-07T20:32:52.3222346Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3222351Z 
2025-05-07T20:32:52.3222881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3222886Z 
2025-05-07T20:32:52.3222999Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3223384Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3223469Z     T=128,
2025-05-07T20:32:52.3223544Z     D=5120,
2025-05-07T20:32:52.3223629Z     scale_ub=None,
2025-05-07T20:32:52.3223721Z     contiguous=False,
2025-05-07T20:32:52.3223804Z     compiled=True,
2025-05-07T20:32:52.3223878Z )
2025-05-07T20:32:52.3224107Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3224279Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.3224284Z 
2025-05-07T20:32:52.3224364Z     @given(
2025-05-07T20:32:52.3224484Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3224592Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3224708Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3224825Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3224941Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3225029Z     )
2025-05-07T20:32:52.3225278Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3225371Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3225677Z         self,
2025-05-07T20:32:52.3225757Z         T: int,
2025-05-07T20:32:52.3225836Z         D: int,
2025-05-07T20:32:52.3225940Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3226034Z         contiguous: bool,
2025-05-07T20:32:52.3226123Z         compiled: bool,
2025-05-07T20:32:52.3226210Z     ) -> None:
2025-05-07T20:32:52.3226306Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3226383Z     
2025-05-07T20:32:52.3226555Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3226636Z     
2025-05-07T20:32:52.3226730Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3226858Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3226954Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3227044Z         x0 = x[:, :D]
2025-05-07T20:32:52.3227136Z         x1 = x[:, D:]
2025-05-07T20:32:52.3227211Z     
2025-05-07T20:32:52.3227302Z         if contiguous:
2025-05-07T20:32:52.3227394Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3227483Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3227560Z     
2025-05-07T20:32:52.3227650Z         if scale_ub is not None:
2025-05-07T20:32:52.3227767Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3227900Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3227976Z             )
2025-05-07T20:32:52.3228061Z         else:
2025-05-07T20:32:52.3228156Z             scale_ub_tensor = None
2025-05-07T20:32:52.3228231Z     
2025-05-07T20:32:52.3228370Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3228463Z             op = silu_mul_quant
2025-05-07T20:32:52.3228551Z             if compiled:
2025-05-07T20:32:52.3228662Z                 op = torch.compile(op)
2025-05-07T20:32:52.3228767Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3228849Z     
2025-05-07T20:32:52.3228949Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3228954Z 
2025-05-07T20:32:52.3229051Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3229192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3229297Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3229399Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3229791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3229890Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3230581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3230696Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3231076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3231496Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3231854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3231957Z     kernel = self.compile(
2025-05-07T20:32:52.3232367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3232549Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3232688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3232700Z 
2025-05-07T20:32:52.3232918Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278d42b40>
2025-05-07T20:32:52.3233734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3234264Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1279459b20>}
2025-05-07T20:32:52.3235060Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3235265Z context = <triton._C.libtriton.ir.context object at 0x7f1278619c70>
2025-05-07T20:32:52.3235270Z 
2025-05-07T20:32:52.3235439Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3235728Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3235844Z                            module_map=module_map)
2025-05-07T20:32:52.3236012Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3236129Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3236214Z E       ^
2025-05-07T20:32:52.3236584Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3236589Z 
2025-05-07T20:32:52.3237033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3237037Z 
2025-05-07T20:32:52.3237142Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3237377Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3237465Z     T=128,
2025-05-07T20:32:52.3237543Z     D=7168,
2025-05-07T20:32:52.3237631Z     scale_ub=1200.0,
2025-05-07T20:32:52.3237724Z     contiguous=False,
2025-05-07T20:32:52.3237811Z     compiled=False,
2025-05-07T20:32:52.3237892Z )
2025-05-07T20:32:52.3238115Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3238290Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:52.3238299Z 
2025-05-07T20:32:52.3238388Z     @given(
2025-05-07T20:32:52.3238508Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3238613Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3238733Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3238850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3238968Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3239046Z     )
2025-05-07T20:32:52.3239296Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3239391Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3239561Z         self,
2025-05-07T20:32:52.3239647Z         T: int,
2025-05-07T20:32:52.3239735Z         D: int,
2025-05-07T20:32:52.3239838Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3239929Z         contiguous: bool,
2025-05-07T20:32:52.3240101Z         compiled: bool,
2025-05-07T20:32:52.3240183Z     ) -> None:
2025-05-07T20:32:52.3240282Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3240361Z     
2025-05-07T20:32:52.3240534Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3240615Z     
2025-05-07T20:32:52.3240710Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3240832Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3240932Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3241015Z         x0 = x[:, :D]
2025-05-07T20:32:52.3241095Z         x1 = x[:, D:]
2025-05-07T20:32:52.3241177Z     
2025-05-07T20:32:52.3241262Z         if contiguous:
2025-05-07T20:32:52.3241363Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3241461Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3241531Z     
2025-05-07T20:32:52.3241622Z         if scale_ub is not None:
2025-05-07T20:32:52.3241734Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3241869Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3241960Z             )
2025-05-07T20:32:52.3242038Z         else:
2025-05-07T20:32:52.3242128Z             scale_ub_tensor = None
2025-05-07T20:32:52.3242210Z     
2025-05-07T20:32:52.3242345Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3242438Z             op = silu_mul_quant
2025-05-07T20:32:52.3242529Z             if compiled:
2025-05-07T20:32:52.3242630Z                 op = torch.compile(op)
2025-05-07T20:32:52.3242736Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3242814Z     
2025-05-07T20:32:52.3242904Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3242909Z 
2025-05-07T20:32:52.3243006Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3243146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3243248Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3243353Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3243882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3243984Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3244388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3244641Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3245004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3245098Z     kernel = self.compile(
2025-05-07T20:32:52.3245502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3245687Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3245815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3245825Z 
2025-05-07T20:32:52.3246037Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278d40140>
2025-05-07T20:32:52.3246847Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3247373Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1278dbcea0>}
2025-05-07T20:32:52.3248248Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3248446Z context = <triton._C.libtriton.ir.context object at 0x7f1278611530>
2025-05-07T20:32:52.3248450Z 
2025-05-07T20:32:52.3248621Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3248975Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3249093Z                            module_map=module_map)
2025-05-07T20:32:52.3249258Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3249371Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3249453Z E       ^
2025-05-07T20:32:52.3249823Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3249827Z 
2025-05-07T20:32:52.3250269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3250274Z 
2025-05-07T20:32:52.3250378Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3250608Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3250685Z     T=128,
2025-05-07T20:32:52.3250767Z     D=5120,
2025-05-07T20:32:52.3250850Z     scale_ub=None,
2025-05-07T20:32:52.3250933Z     contiguous=False,
2025-05-07T20:32:52.3251016Z     compiled=False,
2025-05-07T20:32:52.3251093Z )
2025-05-07T20:32:52.3251314Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3251487Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:52.3251491Z 
2025-05-07T20:32:52.3251568Z     @given(
2025-05-07T20:32:52.3251684Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3251785Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3251900Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3252021Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3252139Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3252215Z     )
2025-05-07T20:32:52.3252466Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3252567Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3252648Z         self,
2025-05-07T20:32:52.3252720Z         T: int,
2025-05-07T20:32:52.3252800Z         D: int,
2025-05-07T20:32:52.3252898Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3252986Z         contiguous: bool,
2025-05-07T20:32:52.3253072Z         compiled: bool,
2025-05-07T20:32:52.3253147Z     ) -> None:
2025-05-07T20:32:52.3253245Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3253317Z     
2025-05-07T20:32:52.3253485Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3253563Z     
2025-05-07T20:32:52.3253654Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3253779Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3253871Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3253950Z         x0 = x[:, :D]
2025-05-07T20:32:52.3254028Z         x1 = x[:, D:]
2025-05-07T20:32:52.3254102Z     
2025-05-07T20:32:52.3254186Z         if contiguous:
2025-05-07T20:32:52.3254283Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3254518Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3254619Z     
2025-05-07T20:32:52.3254707Z         if scale_ub is not None:
2025-05-07T20:32:52.3254812Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3254944Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3255022Z             )
2025-05-07T20:32:52.3255094Z         else:
2025-05-07T20:32:52.3255187Z             scale_ub_tensor = None
2025-05-07T20:32:52.3255262Z     
2025-05-07T20:32:52.3255389Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3255478Z             op = silu_mul_quant
2025-05-07T20:32:52.3255654Z             if compiled:
2025-05-07T20:32:52.3255755Z                 op = torch.compile(op)
2025-05-07T20:32:52.3255857Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3255931Z     
2025-05-07T20:32:52.3256019Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3256098Z 
2025-05-07T20:32:52.3256202Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3256332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3256432Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3256536Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3257058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3257151Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3257529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3257762Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3258124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3258217Z     kernel = self.compile(
2025-05-07T20:32:52.3258618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3258798Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3258925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3258930Z 
2025-05-07T20:32:52.3259135Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278dba900>
2025-05-07T20:32:52.3259948Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3260464Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1278dbcc20>}
2025-05-07T20:32:52.3261253Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3261449Z context = <triton._C.libtriton.ir.context object at 0x7f127886ea70>
2025-05-07T20:32:52.3261453Z 
2025-05-07T20:32:52.3261621Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3261889Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3261990Z                            module_map=module_map)
2025-05-07T20:32:52.3262154Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3262255Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3262338Z E       ^
2025-05-07T20:32:52.3262705Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3262710Z 
2025-05-07T20:32:52.3263142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3263151Z 
2025-05-07T20:32:52.3263263Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3263488Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3263565Z     T=128,
2025-05-07T20:32:52.3263647Z     D=5120,
2025-05-07T20:32:52.3263728Z     scale_ub=1200.0,
2025-05-07T20:32:52.3263808Z     contiguous=True,
2025-05-07T20:32:52.3263897Z     compiled=False,
2025-05-07T20:32:52.3263970Z )
2025-05-07T20:32:52.3264227Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3264426Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:52.3264540Z 
2025-05-07T20:32:52.3264619Z     @given(
2025-05-07T20:32:52.3264742Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3264840Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3264950Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3265153Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3265268Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3265349Z     )
2025-05-07T20:32:52.3265601Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3265700Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3265782Z         self,
2025-05-07T20:32:52.3265863Z         T: int,
2025-05-07T20:32:52.3266739Z         D: int,
2025-05-07T20:32:52.3266842Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3266929Z         contiguous: bool,
2025-05-07T20:32:52.3267010Z         compiled: bool,
2025-05-07T20:32:52.3267095Z     ) -> None:
2025-05-07T20:32:52.3267201Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3267280Z     
2025-05-07T20:32:52.3267457Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3267534Z     
2025-05-07T20:32:52.3267636Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3267770Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3267864Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3267952Z         x0 = x[:, :D]
2025-05-07T20:32:52.3268033Z         x1 = x[:, D:]
2025-05-07T20:32:52.3268109Z     
2025-05-07T20:32:52.3268200Z         if contiguous:
2025-05-07T20:32:52.3268295Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3268387Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3268469Z     
2025-05-07T20:32:52.3268558Z         if scale_ub is not None:
2025-05-07T20:32:52.3268666Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3268813Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3268894Z             )
2025-05-07T20:32:52.3268982Z         else:
2025-05-07T20:32:52.3269081Z             scale_ub_tensor = None
2025-05-07T20:32:52.3269154Z     
2025-05-07T20:32:52.3269285Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3269379Z             op = silu_mul_quant
2025-05-07T20:32:52.3269466Z             if compiled:
2025-05-07T20:32:52.3269565Z                 op = torch.compile(op)
2025-05-07T20:32:52.3269669Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3269737Z     
2025-05-07T20:32:52.3269833Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3269838Z 
2025-05-07T20:32:52.3269933Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3270062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3270167Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3270265Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3270797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3270895Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3271270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3271507Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3271863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3271960Z     kernel = self.compile(
2025-05-07T20:32:52.3272361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3272535Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3272667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3272671Z 
2025-05-07T20:32:52.3272970Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278dbbce0>
2025-05-07T20:32:52.3273781Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3274377Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1278dbeca0>}
2025-05-07T20:32:52.3275170Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3275370Z context = <triton._C.libtriton.ir.context object at 0x7f1278593df0>
2025-05-07T20:32:52.3275374Z 
2025-05-07T20:32:52.3275545Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3275835Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3275944Z                            module_map=module_map)
2025-05-07T20:32:52.3276109Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3276221Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3276302Z E       ^
2025-05-07T20:32:52.3276671Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3276676Z 
2025-05-07T20:32:52.3277116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3277121Z 
2025-05-07T20:32:52.3277224Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3277452Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3277530Z     T=1,
2025-05-07T20:32:52.3277612Z     D=7168,
2025-05-07T20:32:52.3277695Z     scale_ub=1200.0,
2025-05-07T20:32:52.3277784Z     contiguous=True,
2025-05-07T20:32:52.3277867Z     compiled=True,
2025-05-07T20:32:52.3277945Z )
2025-05-07T20:32:52.3278167Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3278335Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:52.3278345Z 
2025-05-07T20:32:52.3278425Z     @given(
2025-05-07T20:32:52.3278544Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3278648Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3278760Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3278877Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3278997Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3279069Z     )
2025-05-07T20:32:52.3279316Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3279415Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3279495Z         self,
2025-05-07T20:32:52.3279574Z         T: int,
2025-05-07T20:32:52.3279648Z         D: int,
2025-05-07T20:32:52.3279742Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3279835Z         contiguous: bool,
2025-05-07T20:32:52.3279921Z         compiled: bool,
2025-05-07T20:32:52.3280000Z     ) -> None:
2025-05-07T20:32:52.3280100Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3280172Z     
2025-05-07T20:32:52.3280340Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3280420Z     
2025-05-07T20:32:52.3280508Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3280630Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3280723Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3280799Z         x0 = x[:, :D]
2025-05-07T20:32:52.3280879Z         x1 = x[:, D:]
2025-05-07T20:32:52.3280951Z     
2025-05-07T20:32:52.3281029Z         if contiguous:
2025-05-07T20:32:52.3281123Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3281297Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3281370Z     
2025-05-07T20:32:52.3281465Z         if scale_ub is not None:
2025-05-07T20:32:52.3281567Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3281699Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3281856Z             )
2025-05-07T20:32:52.3281935Z         else:
2025-05-07T20:32:52.3282030Z             scale_ub_tensor = None
2025-05-07T20:32:52.3282108Z     
2025-05-07T20:32:52.3282235Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3282321Z             op = silu_mul_quant
2025-05-07T20:32:52.3282406Z             if compiled:
2025-05-07T20:32:52.3282502Z                 op = torch.compile(op)
2025-05-07T20:32:52.3282610Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3282682Z     
2025-05-07T20:32:52.3282772Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3282776Z 
2025-05-07T20:32:52.3282877Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3283013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3283115Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3283221Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3283611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3283701Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3284278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3284375Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3284748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3284980Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3285343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3285441Z     kernel = self.compile(
2025-05-07T20:32:52.3285839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3286024Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3286152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3286156Z 
2025-05-07T20:32:52.3286365Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278db9a90>
2025-05-07T20:32:52.3287176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3287693Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1278524040>}
2025-05-07T20:32:52.3288484Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3288686Z context = <triton._C.libtriton.ir.context object at 0x7f127853c830>
2025-05-07T20:32:52.3288690Z 
2025-05-07T20:32:52.3288858Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3289129Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3289235Z                            module_map=module_map)
2025-05-07T20:32:52.3289397Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3289494Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3289571Z E       ^
2025-05-07T20:32:52.3290026Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3290031Z 
2025-05-07T20:32:52.3290464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3290469Z 
2025-05-07T20:32:52.3290677Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3290905Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3290988Z     T=1,
2025-05-07T20:32:52.3291069Z     D=7168,
2025-05-07T20:32:52.3291149Z     scale_ub=1200.0,
2025-05-07T20:32:52.3291235Z     contiguous=False,
2025-05-07T20:32:52.3291321Z     compiled=True,
2025-05-07T20:32:52.3291392Z )
2025-05-07T20:32:52.3291611Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3291780Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:52.3291784Z 
2025-05-07T20:32:52.3291861Z     @given(
2025-05-07T20:32:52.3291996Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3292094Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3292207Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3292328Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3292446Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3292518Z     )
2025-05-07T20:32:52.3292767Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3292860Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3292934Z         self,
2025-05-07T20:32:52.3293013Z         T: int,
2025-05-07T20:32:52.3293087Z         D: int,
2025-05-07T20:32:52.3293182Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3293276Z         contiguous: bool,
2025-05-07T20:32:52.3293357Z         compiled: bool,
2025-05-07T20:32:52.3293435Z     ) -> None:
2025-05-07T20:32:52.3293526Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3293599Z     
2025-05-07T20:32:52.3293775Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3293864Z     
2025-05-07T20:32:52.3293964Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3294112Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3294197Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3294284Z         x0 = x[:, :D]
2025-05-07T20:32:52.3294463Z         x1 = x[:, D:]
2025-05-07T20:32:52.3294560Z     
2025-05-07T20:32:52.3294676Z         if contiguous:
2025-05-07T20:32:52.3294806Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3294893Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3294969Z     
2025-05-07T20:32:52.3295053Z         if scale_ub is not None:
2025-05-07T20:32:52.3295152Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3295290Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3295362Z             )
2025-05-07T20:32:52.3295436Z         else:
2025-05-07T20:32:52.3295533Z             scale_ub_tensor = None
2025-05-07T20:32:52.3295610Z     
2025-05-07T20:32:52.3295735Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3295828Z             op = silu_mul_quant
2025-05-07T20:32:52.3295909Z             if compiled:
2025-05-07T20:32:52.3296005Z                 op = torch.compile(op)
2025-05-07T20:32:52.3296115Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3296183Z     
2025-05-07T20:32:52.3296272Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3296277Z 
2025-05-07T20:32:52.3296368Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3296495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3296601Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3296698Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3297081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3297173Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3297787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3297890Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3298267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3298572Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3298928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3299021Z     kernel = self.compile(
2025-05-07T20:32:52.3299416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3299592Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3299717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3299721Z 
2025-05-07T20:32:52.3299932Z self = <triton.compiler.compiler.ASTSource object at 0x7f12785145f0>
2025-05-07T20:32:52.3300737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3301257Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1278524ea0>}
2025-05-07T20:32:52.3302051Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3302241Z context = <triton._C.libtriton.ir.context object at 0x7f12788fadb0>
2025-05-07T20:32:52.3302246Z 
2025-05-07T20:32:52.3302420Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3302685Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3302793Z                            module_map=module_map)
2025-05-07T20:32:52.3302954Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3303050Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3303125Z E       ^
2025-05-07T20:32:52.3303491Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3303495Z 
2025-05-07T20:32:52.3303921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3303925Z 
2025-05-07T20:32:52.3304028Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3304249Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3304328Z     T=1,
2025-05-07T20:32:52.3304409Z     D=7168,
2025-05-07T20:32:52.3304502Z     scale_ub=None,
2025-05-07T20:32:52.3304593Z     contiguous=False,
2025-05-07T20:32:52.3304674Z     compiled=True,
2025-05-07T20:32:52.3304747Z )
2025-05-07T20:32:52.3304974Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3305141Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.3305145Z 
2025-05-07T20:32:52.3305247Z     @given(
2025-05-07T20:32:52.3305414Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3305540Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3305690Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3305843Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3305973Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3306051Z     )
2025-05-07T20:32:52.3306299Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3306487Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3306571Z         self,
2025-05-07T20:32:52.3306647Z         T: int,
2025-05-07T20:32:52.3306728Z         D: int,
2025-05-07T20:32:52.3306840Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3307006Z         contiguous: bool,
2025-05-07T20:32:52.3310936Z         compiled: bool,
2025-05-07T20:32:52.3311044Z     ) -> None:
2025-05-07T20:32:52.3311144Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3311213Z     
2025-05-07T20:32:52.3311384Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3311460Z     
2025-05-07T20:32:52.3311550Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3311672Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3311764Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3311841Z         x0 = x[:, :D]
2025-05-07T20:32:52.3311918Z         x1 = x[:, D:]
2025-05-07T20:32:52.3311992Z     
2025-05-07T20:32:52.3312071Z         if contiguous:
2025-05-07T20:32:52.3312170Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3312258Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3312327Z     
2025-05-07T20:32:52.3312415Z         if scale_ub is not None:
2025-05-07T20:32:52.3312516Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3312653Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3312731Z             )
2025-05-07T20:32:52.3312806Z         else:
2025-05-07T20:32:52.3312896Z             scale_ub_tensor = None
2025-05-07T20:32:52.3312975Z     
2025-05-07T20:32:52.3313102Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3313186Z             op = silu_mul_quant
2025-05-07T20:32:52.3313267Z             if compiled:
2025-05-07T20:32:52.3313361Z                 op = torch.compile(op)
2025-05-07T20:32:52.3313466Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3313537Z     
2025-05-07T20:32:52.3313623Z         y_fp8, y_scale = fn()
2025-05-07T20:32:52.3313747Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:52.3313813Z     
2025-05-07T20:32:52.3313943Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3314046Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:52.3314147Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:52.3314265Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:52.3314408Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.3314476Z     
2025-05-07T20:32:52.3314571Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:52.3314583Z 
2025-05-07T20:32:52.3314676Z moe/activation_test.py:126: 
2025-05-07T20:32:52.3314804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3314913Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:52.3315046Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.3315639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:52.3315738Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:52.3316110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3316343Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3316728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:52.3316987Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.3317383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:52.3317546Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:52.3318007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:52.3318087Z     fn()
2025-05-07T20:32:52.3318506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:52.3318670Z     self.fn.run(
2025-05-07T20:32:52.3319025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3319115Z     kernel = self.compile(
2025-05-07T20:32:52.3319517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3319689Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3319819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3319832Z 
2025-05-07T20:32:52.3320039Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278d25a30>
2025-05-07T20:32:52.3320855Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3321372Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1278e5ec00>}
2025-05-07T20:32:52.3322169Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3322364Z context = <triton._C.libtriton.ir.context object at 0x7f1279c67cb0>
2025-05-07T20:32:52.3322368Z 
2025-05-07T20:32:52.3322537Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3322803Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3322915Z                            module_map=module_map)
2025-05-07T20:32:52.3323074Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3323183Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:52.3323264Z E       ^
2025-05-07T20:32:52.3323637Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3323642Z 
2025-05-07T20:32:52.3324080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3324084Z 
2025-05-07T20:32:52.3324185Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3324413Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3324488Z     T=1,
2025-05-07T20:32:52.3324559Z     D=5120,
2025-05-07T20:32:52.3324645Z     scale_ub=1200.0,
2025-05-07T20:32:52.3324727Z     contiguous=False,
2025-05-07T20:32:52.3324808Z     compiled=True,
2025-05-07T20:32:52.3324889Z )
2025-05-07T20:32:52.3325112Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3325280Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:52.3325289Z 
2025-05-07T20:32:52.3325364Z     @given(
2025-05-07T20:32:52.3325738Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3325841Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3325952Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3326065Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3326175Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3326244Z     )
2025-05-07T20:32:52.3326493Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3326587Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3326662Z         self,
2025-05-07T20:32:52.3326741Z         T: int,
2025-05-07T20:32:52.3326966Z         D: int,
2025-05-07T20:32:52.3327064Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3327152Z         contiguous: bool,
2025-05-07T20:32:52.3327240Z         compiled: bool,
2025-05-07T20:32:52.3327316Z     ) -> None:
2025-05-07T20:32:52.3327542Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3327612Z     
2025-05-07T20:32:52.3327781Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3327855Z     
2025-05-07T20:32:52.3327947Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3328069Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3328159Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3328236Z         x0 = x[:, :D]
2025-05-07T20:32:52.3328316Z         x1 = x[:, D:]
2025-05-07T20:32:52.3328385Z     
2025-05-07T20:32:52.3328461Z         if contiguous:
2025-05-07T20:32:52.3328548Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3328639Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3328711Z     
2025-05-07T20:32:52.3328803Z         if scale_ub is not None:
2025-05-07T20:32:52.3328907Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3329036Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3329117Z             )
2025-05-07T20:32:52.3329199Z         else:
2025-05-07T20:32:52.3329288Z             scale_ub_tensor = None
2025-05-07T20:32:52.3329361Z     
2025-05-07T20:32:52.3329488Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3329573Z             op = silu_mul_quant
2025-05-07T20:32:52.3329659Z             if compiled:
2025-05-07T20:32:52.3329754Z                 op = torch.compile(op)
2025-05-07T20:32:52.3329855Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3329926Z     
2025-05-07T20:32:52.3330010Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3330015Z 
2025-05-07T20:32:52.3330106Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3330243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3330339Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3330440Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3330821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3330915Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3331443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3331537Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3331908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3332139Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3332494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3332588Z     kernel = self.compile(
2025-05-07T20:32:52.3332993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3333169Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3333307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3333311Z 
2025-05-07T20:32:52.3333514Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278d24bf0>
2025-05-07T20:32:52.3334352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3335008Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1279459f80>}
2025-05-07T20:32:52.3335890Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3336082Z context = <triton._C.libtriton.ir.context object at 0x7f1278765870>
2025-05-07T20:32:52.3336163Z 
2025-05-07T20:32:52.3336328Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3336606Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3336711Z                            module_map=module_map)
2025-05-07T20:32:52.3336871Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3336972Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3337048Z E       ^
2025-05-07T20:32:52.3337423Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3337428Z 
2025-05-07T20:32:52.3337868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3337872Z 
2025-05-07T20:32:52.3337978Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3338204Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3338283Z     T=1,
2025-05-07T20:32:52.3338358Z     D=5120,
2025-05-07T20:32:52.3338438Z     scale_ub=1200.0,
2025-05-07T20:32:52.3338521Z     contiguous=False,
2025-05-07T20:32:52.3338607Z     compiled=False,
2025-05-07T20:32:52.3338680Z )
2025-05-07T20:32:52.3338900Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3339076Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:52.3339080Z 
2025-05-07T20:32:52.3339156Z     @given(
2025-05-07T20:32:52.3339270Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3339371Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3339487Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3339601Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3339711Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3339783Z     )
2025-05-07T20:32:52.3340040Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3340131Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3340204Z         self,
2025-05-07T20:32:52.3340279Z         T: int,
2025-05-07T20:32:52.3340350Z         D: int,
2025-05-07T20:32:52.3340444Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3340536Z         contiguous: bool,
2025-05-07T20:32:52.3340616Z         compiled: bool,
2025-05-07T20:32:52.3340689Z     ) -> None:
2025-05-07T20:32:52.3340784Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3340853Z     
2025-05-07T20:32:52.3341020Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3341097Z     
2025-05-07T20:32:52.3341189Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3341314Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3341397Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3341472Z         x0 = x[:, :D]
2025-05-07T20:32:52.3341555Z         x1 = x[:, D:]
2025-05-07T20:32:52.3341624Z     
2025-05-07T20:32:52.3341705Z         if contiguous:
2025-05-07T20:32:52.3341799Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3341884Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3341953Z     
2025-05-07T20:32:52.3342046Z         if scale_ub is not None:
2025-05-07T20:32:52.3342146Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3342280Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3342356Z             )
2025-05-07T20:32:52.3342427Z         else:
2025-05-07T20:32:52.3342522Z             scale_ub_tensor = None
2025-05-07T20:32:52.3342598Z     
2025-05-07T20:32:52.3342807Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3342903Z             op = silu_mul_quant
2025-05-07T20:32:52.3342983Z             if compiled:
2025-05-07T20:32:52.3343076Z                 op = torch.compile(op)
2025-05-07T20:32:52.3343181Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3343323Z     
2025-05-07T20:32:52.3343411Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3343415Z 
2025-05-07T20:32:52.3343510Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3343645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3343744Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3343842Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3344365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3344460Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3344843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3345069Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3345429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3345521Z     kernel = self.compile(
2025-05-07T20:32:52.3345926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3346099Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3346227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3346231Z 
2025-05-07T20:32:52.3346439Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278eb0fe0>
2025-05-07T20:32:52.3347253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3347770Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f12793bae80>}
2025-05-07T20:32:52.3348562Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3348755Z context = <triton._C.libtriton.ir.context object at 0x7f1278825570>
2025-05-07T20:32:52.3348759Z 
2025-05-07T20:32:52.3348925Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3349192Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3349299Z                            module_map=module_map)
2025-05-07T20:32:52.3349463Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3349560Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3349637Z E       ^
2025-05-07T20:32:52.3350004Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3350012Z 
2025-05-07T20:32:52.3350447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3350451Z 
2025-05-07T20:32:52.3350550Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3350772Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3350853Z     T=16384,
2025-05-07T20:32:52.3350930Z     D=5120,
2025-05-07T20:32:52.3351007Z     scale_ub=1200.0,
2025-05-07T20:32:52.3351091Z     contiguous=False,
2025-05-07T20:32:52.3351170Z     compiled=True,
2025-05-07T20:32:52.3351243Z )
2025-05-07T20:32:52.3351467Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3351730Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:52.3351735Z 
2025-05-07T20:32:52.3351814Z     @given(
2025-05-07T20:32:52.3351926Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3352099Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3352214Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3352325Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3352432Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3352502Z     )
2025-05-07T20:32:52.3352747Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3352834Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3352911Z         self,
2025-05-07T20:32:52.3352986Z         T: int,
2025-05-07T20:32:52.3353060Z         D: int,
2025-05-07T20:32:52.3353153Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3353245Z         contiguous: bool,
2025-05-07T20:32:52.3353330Z         compiled: bool,
2025-05-07T20:32:52.3353404Z     ) -> None:
2025-05-07T20:32:52.3353491Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3353563Z     
2025-05-07T20:32:52.3353730Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3353808Z     
2025-05-07T20:32:52.3353900Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3354022Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3354108Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3354189Z         x0 = x[:, :D]
2025-05-07T20:32:52.3354279Z         x1 = x[:, D:]
2025-05-07T20:32:52.3354357Z     
2025-05-07T20:32:52.3354454Z         if contiguous:
2025-05-07T20:32:52.3354557Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3354648Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3354716Z     
2025-05-07T20:32:52.3354803Z         if scale_ub is not None:
2025-05-07T20:32:52.3354906Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3355040Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3355112Z             )
2025-05-07T20:32:52.3355190Z         else:
2025-05-07T20:32:52.3355279Z             scale_ub_tensor = None
2025-05-07T20:32:52.3355349Z     
2025-05-07T20:32:52.3355484Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3355574Z             op = silu_mul_quant
2025-05-07T20:32:52.3355655Z             if compiled:
2025-05-07T20:32:52.3355755Z                 op = torch.compile(op)
2025-05-07T20:32:52.3355857Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3355931Z     
2025-05-07T20:32:52.3356016Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3356021Z 
2025-05-07T20:32:52.3356115Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3356249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3356345Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3356446Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3356834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3356922Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3357444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3357542Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3357915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3358142Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3358495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3358584Z     kernel = self.compile(
2025-05-07T20:32:52.3359164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3359342Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3359473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3359477Z 
2025-05-07T20:32:52.3359761Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278eb1d30>
2025-05-07T20:32:52.3360567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3361079Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f12799ba8e0>}
2025-05-07T20:32:52.3361873Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3362067Z context = <triton._C.libtriton.ir.context object at 0x7f12788dd570>
2025-05-07T20:32:52.3362072Z 
2025-05-07T20:32:52.3362236Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3362513Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3362618Z                            module_map=module_map)
2025-05-07T20:32:52.3362779Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3362876Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3362950Z E       ^
2025-05-07T20:32:52.3363318Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3363323Z 
2025-05-07T20:32:52.3363759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3363768Z 
2025-05-07T20:32:52.3363876Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3364110Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3364192Z     T=2048,
2025-05-07T20:32:52.3364286Z     D=7168,
2025-05-07T20:32:52.3364388Z     scale_ub=1200.0,
2025-05-07T20:32:52.3364493Z     contiguous=False,
2025-05-07T20:32:52.3364571Z     compiled=True,
2025-05-07T20:32:52.3364649Z )
2025-05-07T20:32:52.3364871Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3365045Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:52.3365057Z 
2025-05-07T20:32:52.3365135Z     @given(
2025-05-07T20:32:52.3365249Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3365350Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3365464Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3365586Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3365700Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3365771Z     )
2025-05-07T20:32:52.3366019Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3366121Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3366194Z         self,
2025-05-07T20:32:52.3366269Z         T: int,
2025-05-07T20:32:52.3366347Z         D: int,
2025-05-07T20:32:52.3366441Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3366535Z         contiguous: bool,
2025-05-07T20:32:52.3366615Z         compiled: bool,
2025-05-07T20:32:52.3366689Z     ) -> None:
2025-05-07T20:32:52.3366781Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3366849Z     
2025-05-07T20:32:52.3367016Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3367091Z     
2025-05-07T20:32:52.3367179Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3367387Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3367479Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3367554Z         x0 = x[:, :D]
2025-05-07T20:32:52.3367633Z         x1 = x[:, D:]
2025-05-07T20:32:52.3367706Z     
2025-05-07T20:32:52.3367785Z         if contiguous:
2025-05-07T20:32:52.3367957Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3368047Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3368117Z     
2025-05-07T20:32:52.3368207Z         if scale_ub is not None:
2025-05-07T20:32:52.3368308Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3368445Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3368519Z             )
2025-05-07T20:32:52.3368596Z         else:
2025-05-07T20:32:52.3368686Z             scale_ub_tensor = None
2025-05-07T20:32:52.3368760Z     
2025-05-07T20:32:52.3368891Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3368976Z             op = silu_mul_quant
2025-05-07T20:32:52.3369062Z             if compiled:
2025-05-07T20:32:52.3369166Z                 op = torch.compile(op)
2025-05-07T20:32:52.3369277Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3369347Z     
2025-05-07T20:32:52.3369438Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3369449Z 
2025-05-07T20:32:52.3369545Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3369672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3369775Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3369875Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3370261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3370349Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3370867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3370964Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3371341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3371569Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3371921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3372019Z     kernel = self.compile(
2025-05-07T20:32:52.3372420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3372591Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3372719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3372723Z 
2025-05-07T20:32:52.3372933Z self = <triton.compiler.compiler.ASTSource object at 0x7f1278eb3e60>
2025-05-07T20:32:52.3373743Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3374255Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1279602840>}
2025-05-07T20:32:52.3375182Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3375382Z context = <triton._C.libtriton.ir.context object at 0x7f121bf06e30>
2025-05-07T20:32:52.3375387Z 
2025-05-07T20:32:52.3375556Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3375829Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3376026Z                            module_map=module_map)
2025-05-07T20:32:52.3376189Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3376288Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3376374Z E       ^
2025-05-07T20:32:52.3376736Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3376816Z 
2025-05-07T20:32:52.3377255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3377260Z 
2025-05-07T20:32:52.3377363Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3377589Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3377672Z     T=1,
2025-05-07T20:32:52.3377746Z     D=5120,
2025-05-07T20:32:52.3377829Z     scale_ub=None,
2025-05-07T20:32:52.3377923Z     contiguous=False,
2025-05-07T20:32:52.3378006Z     compiled=False,
2025-05-07T20:32:52.3378082Z )
2025-05-07T20:32:52.3378314Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3378479Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:52.3378484Z 
2025-05-07T20:32:52.3378565Z     @given(
2025-05-07T20:32:52.3378688Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3378790Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3378914Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3379032Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3379146Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3379222Z     )
2025-05-07T20:32:52.3379477Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3379574Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3379653Z         self,
2025-05-07T20:32:52.3379729Z         T: int,
2025-05-07T20:32:52.3379810Z         D: int,
2025-05-07T20:32:52.3379917Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3380006Z         contiguous: bool,
2025-05-07T20:32:52.3380094Z         compiled: bool,
2025-05-07T20:32:52.3380169Z     ) -> None:
2025-05-07T20:32:52.3380259Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3380338Z     
2025-05-07T20:32:52.3380505Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3380577Z     
2025-05-07T20:32:52.3380667Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3380789Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3380879Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3380956Z         x0 = x[:, :D]
2025-05-07T20:32:52.3381034Z         x1 = x[:, D:]
2025-05-07T20:32:52.3381108Z     
2025-05-07T20:32:52.3381187Z         if contiguous:
2025-05-07T20:32:52.3381274Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3381363Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3381437Z     
2025-05-07T20:32:52.3381523Z         if scale_ub is not None:
2025-05-07T20:32:52.3381635Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3381770Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3381843Z             )
2025-05-07T20:32:52.3381918Z         else:
2025-05-07T20:32:52.3382013Z             scale_ub_tensor = None
2025-05-07T20:32:52.3382088Z     
2025-05-07T20:32:52.3382216Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3382303Z             op = silu_mul_quant
2025-05-07T20:32:52.3382389Z             if compiled:
2025-05-07T20:32:52.3382484Z                 op = torch.compile(op)
2025-05-07T20:32:52.3382586Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3382660Z     
2025-05-07T20:32:52.3382746Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3382751Z 
2025-05-07T20:32:52.3382845Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3382978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3383160Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3383259Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3383801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3384001Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3384381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3384606Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3384961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3385055Z     kernel = self.compile(
2025-05-07T20:32:52.3385454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3385638Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3385773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3385778Z 
2025-05-07T20:32:52.3385980Z self = <triton.compiler.compiler.ASTSource object at 0x7f12793b3a40>
2025-05-07T20:32:52.3386792Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3387314Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127a0362a0>}
2025-05-07T20:32:52.3388108Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3388307Z context = <triton._C.libtriton.ir.context object at 0x7f121bf21970>
2025-05-07T20:32:52.3388312Z 
2025-05-07T20:32:52.3388478Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3388752Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3388862Z                            module_map=module_map)
2025-05-07T20:32:52.3389025Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3389118Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3389193Z E       ^
2025-05-07T20:32:52.3389562Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3389567Z 
2025-05-07T20:32:52.3389997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3390001Z 
2025-05-07T20:32:52.3390104Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3390334Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3390411Z     T=4096,
2025-05-07T20:32:52.3390490Z     D=7168,
2025-05-07T20:32:52.3390572Z     scale_ub=1200.0,
2025-05-07T20:32:52.3390660Z     contiguous=False,
2025-05-07T20:32:52.3390752Z     compiled=False,
2025-05-07T20:32:52.3390822Z )
2025-05-07T20:32:52.3391041Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3391220Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:52.3391224Z 
2025-05-07T20:32:52.3391300Z     @given(
2025-05-07T20:32:52.3391415Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3391510Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3391621Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3391734Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3391843Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3391914Z     )
2025-05-07T20:32:52.3392270Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3392364Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3392443Z         self,
2025-05-07T20:32:52.3392523Z         T: int,
2025-05-07T20:32:52.3392668Z         D: int,
2025-05-07T20:32:52.3392765Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3392850Z         contiguous: bool,
2025-05-07T20:32:52.3392932Z         compiled: bool,
2025-05-07T20:32:52.3393009Z     ) -> None:
2025-05-07T20:32:52.3393103Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3393183Z     
2025-05-07T20:32:52.3393361Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3393436Z     
2025-05-07T20:32:52.3393531Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3393662Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3393751Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3393829Z         x0 = x[:, :D]
2025-05-07T20:32:52.3393917Z         x1 = x[:, D:]
2025-05-07T20:32:52.3393991Z     
2025-05-07T20:32:52.3394076Z         if contiguous:
2025-05-07T20:32:52.3394173Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3394261Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3394359Z     
2025-05-07T20:32:52.3394462Z         if scale_ub is not None:
2025-05-07T20:32:52.3394591Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3394733Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3394810Z             )
2025-05-07T20:32:52.3394887Z         else:
2025-05-07T20:32:52.3394984Z             scale_ub_tensor = None
2025-05-07T20:32:52.3395059Z     
2025-05-07T20:32:52.3395188Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3395277Z             op = silu_mul_quant
2025-05-07T20:32:52.3395356Z             if compiled:
2025-05-07T20:32:52.3395454Z                 op = torch.compile(op)
2025-05-07T20:32:52.3395573Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3395643Z     
2025-05-07T20:32:52.3395735Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3395739Z 
2025-05-07T20:32:52.3395830Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3395961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3396067Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3396165Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3396688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3396783Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3397153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3397381Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3397740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3397833Z     kernel = self.compile(
2025-05-07T20:32:52.3398234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3398409Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3398544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3398551Z 
2025-05-07T20:32:52.3398752Z self = <triton.compiler.compiler.ASTSource object at 0x7f12793b0f20>
2025-05-07T20:32:52.3399558Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3400157Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127a0ef100>}
2025-05-07T20:32:52.3400949Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3401220Z context = <triton._C.libtriton.ir.context object at 0x7f127810f430>
2025-05-07T20:32:52.3401225Z 
2025-05-07T20:32:52.3401393Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3401661Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3401769Z                            module_map=module_map)
2025-05-07T20:32:52.3401930Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3402031Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3402112Z E       ^
2025-05-07T20:32:52.3402481Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3402486Z 
2025-05-07T20:32:52.3402919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3402924Z 
2025-05-07T20:32:52.3403025Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3403257Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3403332Z     T=16384,
2025-05-07T20:32:52.3403408Z     D=7168,
2025-05-07T20:32:52.3403490Z     scale_ub=None,
2025-05-07T20:32:52.3403572Z     contiguous=True,
2025-05-07T20:32:52.3403655Z     compiled=True,
2025-05-07T20:32:52.3403730Z )
2025-05-07T20:32:52.3403952Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3404128Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:52.3404133Z 
2025-05-07T20:32:52.3404213Z     @given(
2025-05-07T20:32:52.3404326Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3404428Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3404545Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3404657Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3404772Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3404844Z     )
2025-05-07T20:32:52.3405090Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3405183Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3405255Z         self,
2025-05-07T20:32:52.3405328Z         T: int,
2025-05-07T20:32:52.3405406Z         D: int,
2025-05-07T20:32:52.3405500Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3405586Z         contiguous: bool,
2025-05-07T20:32:52.3405672Z         compiled: bool,
2025-05-07T20:32:52.3405747Z     ) -> None:
2025-05-07T20:32:52.3405840Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3405914Z     
2025-05-07T20:32:52.3406088Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3406167Z     
2025-05-07T20:32:52.3406258Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3406381Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3406474Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3406558Z         x0 = x[:, :D]
2025-05-07T20:32:52.3406636Z         x1 = x[:, D:]
2025-05-07T20:32:52.3406715Z     
2025-05-07T20:32:52.3406798Z         if contiguous:
2025-05-07T20:32:52.3406890Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3406984Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3407058Z     
2025-05-07T20:32:52.3407148Z         if scale_ub is not None:
2025-05-07T20:32:52.3407253Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3407386Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3407462Z             )
2025-05-07T20:32:52.3407539Z         else:
2025-05-07T20:32:52.3407631Z             scale_ub_tensor = None
2025-05-07T20:32:52.3407794Z     
2025-05-07T20:32:52.3407927Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3408014Z             op = silu_mul_quant
2025-05-07T20:32:52.3408096Z             if compiled:
2025-05-07T20:32:52.3408191Z                 op = torch.compile(op)
2025-05-07T20:32:52.3408370Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3408443Z     
2025-05-07T20:32:52.3408530Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3408535Z 
2025-05-07T20:32:52.3408629Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3408763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3408861Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3408963Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3409346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3409439Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3409968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3410066Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3410438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3410676Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3411030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3411123Z     kernel = self.compile(
2025-05-07T20:32:52.3411520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3411695Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3411827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3411831Z 
2025-05-07T20:32:52.3412042Z self = <triton.compiler.compiler.ASTSource object at 0x7f1279373740>
2025-05-07T20:32:52.3412853Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3413371Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127a6a5300>}
2025-05-07T20:32:52.3414167Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3414501Z context = <triton._C.libtriton.ir.context object at 0x7f1278170130>
2025-05-07T20:32:52.3414507Z 
2025-05-07T20:32:52.3414693Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3414964Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3415065Z                            module_map=module_map)
2025-05-07T20:32:52.3415224Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3415327Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3415401Z E       ^
2025-05-07T20:32:52.3415773Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3415777Z 
2025-05-07T20:32:52.3416208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3416213Z 
2025-05-07T20:32:52.3416311Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3416539Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3416616Z     T=4096,
2025-05-07T20:32:52.3416774Z     D=5120,
2025-05-07T20:32:52.3416855Z     scale_ub=None,
2025-05-07T20:32:52.3416935Z     contiguous=False,
2025-05-07T20:32:52.3417032Z     compiled=True,
2025-05-07T20:32:52.3417100Z )
2025-05-07T20:32:52.3417319Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3417600Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.3417604Z 
2025-05-07T20:32:52.3417683Z     @given(
2025-05-07T20:32:52.3417800Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3417904Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3418018Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3418132Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3418246Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3418323Z     )
2025-05-07T20:32:52.3418576Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3418675Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3418748Z         self,
2025-05-07T20:32:52.3418829Z         T: int,
2025-05-07T20:32:52.3418907Z         D: int,
2025-05-07T20:32:52.3419000Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3419093Z         contiguous: bool,
2025-05-07T20:32:52.3419174Z         compiled: bool,
2025-05-07T20:32:52.3419250Z     ) -> None:
2025-05-07T20:32:52.3419344Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3419420Z     
2025-05-07T20:32:52.3419589Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3419668Z     
2025-05-07T20:32:52.3419760Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3419887Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3419976Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3420054Z         x0 = x[:, :D]
2025-05-07T20:32:52.3420136Z         x1 = x[:, D:]
2025-05-07T20:32:52.3420214Z     
2025-05-07T20:32:52.3420299Z         if contiguous:
2025-05-07T20:32:52.3420400Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3420489Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3420562Z     
2025-05-07T20:32:52.3420656Z         if scale_ub is not None:
2025-05-07T20:32:52.3420761Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3420901Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3420980Z             )
2025-05-07T20:32:52.3421055Z         else:
2025-05-07T20:32:52.3421145Z             scale_ub_tensor = None
2025-05-07T20:32:52.3421222Z     
2025-05-07T20:32:52.3421350Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3421442Z             op = silu_mul_quant
2025-05-07T20:32:52.3421522Z             if compiled:
2025-05-07T20:32:52.3421618Z                 op = torch.compile(op)
2025-05-07T20:32:52.3421722Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3421794Z     
2025-05-07T20:32:52.3421879Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3421883Z 
2025-05-07T20:32:52.3421990Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3422119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3422214Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3422313Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3422706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3422800Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3423323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3423420Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3423797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3424024Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3424514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3424605Z     kernel = self.compile(
2025-05-07T20:32:52.3425002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3425257Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3425548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3425555Z 
2025-05-07T20:32:52.3425808Z self = <triton.compiler.compiler.ASTSource object at 0x7f12796212b0>
2025-05-07T20:32:52.3426623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3427142Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ac44360>}
2025-05-07T20:32:52.3427936Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3428136Z context = <triton._C.libtriton.ir.context object at 0x7f1278278030>
2025-05-07T20:32:52.3428140Z 
2025-05-07T20:32:52.3428317Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3428588Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3428698Z                            module_map=module_map)
2025-05-07T20:32:52.3428862Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3428962Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3429040Z E       ^
2025-05-07T20:32:52.3429413Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3429418Z 
2025-05-07T20:32:52.3429849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3429860Z 
2025-05-07T20:32:52.3429961Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3430188Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3430261Z     T=4096,
2025-05-07T20:32:52.3430337Z     D=5120,
2025-05-07T20:32:52.3430420Z     scale_ub=1200.0,
2025-05-07T20:32:52.3430504Z     contiguous=False,
2025-05-07T20:32:52.3430590Z     compiled=False,
2025-05-07T20:32:52.3430659Z )
2025-05-07T20:32:52.3430883Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3431059Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:52.3431063Z 
2025-05-07T20:32:52.3431136Z     @given(
2025-05-07T20:32:52.3431257Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3431354Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3431463Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3431581Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3431695Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3431765Z     )
2025-05-07T20:32:52.3432017Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3432107Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3432181Z         self,
2025-05-07T20:32:52.3432254Z         T: int,
2025-05-07T20:32:52.3432324Z         D: int,
2025-05-07T20:32:52.3435925Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3436037Z         contiguous: bool,
2025-05-07T20:32:52.3436127Z         compiled: bool,
2025-05-07T20:32:52.3436206Z     ) -> None:
2025-05-07T20:32:52.3436303Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3436539Z     
2025-05-07T20:32:52.3436719Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3436795Z     
2025-05-07T20:32:52.3436893Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3437018Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3437230Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3437315Z         x0 = x[:, :D]
2025-05-07T20:32:52.3437397Z         x1 = x[:, D:]
2025-05-07T20:32:52.3437478Z     
2025-05-07T20:32:52.3437572Z         if contiguous:
2025-05-07T20:32:52.3437666Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3437762Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3437839Z     
2025-05-07T20:32:52.3437931Z         if scale_ub is not None:
2025-05-07T20:32:52.3438045Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3438183Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3438267Z             )
2025-05-07T20:32:52.3438350Z         else:
2025-05-07T20:32:52.3438453Z             scale_ub_tensor = None
2025-05-07T20:32:52.3438530Z     
2025-05-07T20:32:52.3438658Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3438748Z             op = silu_mul_quant
2025-05-07T20:32:52.3438833Z             if compiled:
2025-05-07T20:32:52.3438939Z                 op = torch.compile(op)
2025-05-07T20:32:52.3439041Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3439116Z     
2025-05-07T20:32:52.3439204Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3439209Z 
2025-05-07T20:32:52.3439308Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3439440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3439540Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3439642Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3440167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3440270Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3440648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3440873Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3441238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3441336Z     kernel = self.compile(
2025-05-07T20:32:52.3441739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3441916Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3442044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3442049Z 
2025-05-07T20:32:52.3442254Z self = <triton.compiler.compiler.ASTSource object at 0x7f1279623d10>
2025-05-07T20:32:52.3443068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3443582Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ac46700>}
2025-05-07T20:32:52.3444373Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3444563Z context = <triton._C.libtriton.ir.context object at 0x7f127831f430>
2025-05-07T20:32:52.3444568Z 
2025-05-07T20:32:52.3444737Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3445088Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3445195Z                            module_map=module_map)
2025-05-07T20:32:52.3445358Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3445455Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3445613Z E       ^
2025-05-07T20:32:52.3445983Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3445988Z 
2025-05-07T20:32:52.3446419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3446424Z 
2025-05-07T20:32:52.3446527Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3446756Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3446834Z     T=4096,
2025-05-07T20:32:52.3446915Z     D=5120,
2025-05-07T20:32:52.3447002Z     scale_ub=1200.0,
2025-05-07T20:32:52.3447090Z     contiguous=False,
2025-05-07T20:32:52.3447185Z     compiled=True,
2025-05-07T20:32:52.3447264Z )
2025-05-07T20:32:52.3447491Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3447670Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:52.3447679Z 
2025-05-07T20:32:52.3447759Z     @given(
2025-05-07T20:32:52.3447886Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3447989Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3448110Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3448234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3448347Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3448419Z     )
2025-05-07T20:32:52.3448673Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3448766Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3448844Z         self,
2025-05-07T20:32:52.3448920Z         T: int,
2025-05-07T20:32:52.3448998Z         D: int,
2025-05-07T20:32:52.3449098Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3449188Z         contiguous: bool,
2025-05-07T20:32:52.3449274Z         compiled: bool,
2025-05-07T20:32:52.3449351Z     ) -> None:
2025-05-07T20:32:52.3449450Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3449522Z     
2025-05-07T20:32:52.3449693Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3449768Z     
2025-05-07T20:32:52.3449859Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3449983Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3450072Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3450153Z         x0 = x[:, :D]
2025-05-07T20:32:52.3450232Z         x1 = x[:, D:]
2025-05-07T20:32:52.3450304Z     
2025-05-07T20:32:52.3450389Z         if contiguous:
2025-05-07T20:32:52.3450479Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3450566Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3450646Z     
2025-05-07T20:32:52.3450736Z         if scale_ub is not None:
2025-05-07T20:32:52.3450840Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3450974Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3451054Z             )
2025-05-07T20:32:52.3451129Z         else:
2025-05-07T20:32:52.3451224Z             scale_ub_tensor = None
2025-05-07T20:32:52.3451296Z     
2025-05-07T20:32:52.3451429Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3451523Z             op = silu_mul_quant
2025-05-07T20:32:52.3451608Z             if compiled:
2025-05-07T20:32:52.3451709Z                 op = torch.compile(op)
2025-05-07T20:32:52.3451812Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3451883Z     
2025-05-07T20:32:52.3451979Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3451984Z 
2025-05-07T20:32:52.3452083Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3452296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3452399Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3452497Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3452887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3453082Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3453599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3453699Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3454071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3454298Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3454776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3454873Z     kernel = self.compile(
2025-05-07T20:32:52.3455279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3455454Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3455589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3455593Z 
2025-05-07T20:32:52.3455805Z self = <triton.compiler.compiler.ASTSource object at 0x7f127a0d5f70>
2025-05-07T20:32:52.3456612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3457133Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127b37b2e0>}
2025-05-07T20:32:52.3457929Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3458120Z context = <triton._C.libtriton.ir.context object at 0x7f127832e470>
2025-05-07T20:32:52.3458131Z 
2025-05-07T20:32:52.3458298Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3458569Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3458678Z                            module_map=module_map)
2025-05-07T20:32:52.3458840Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3458938Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3459017Z E       ^
2025-05-07T20:32:52.3459386Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3459391Z 
2025-05-07T20:32:52.3459834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3459839Z 
2025-05-07T20:32:52.3459941Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3460174Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3460254Z     T=2048,
2025-05-07T20:32:52.3460330Z     D=7168,
2025-05-07T20:32:52.3460412Z     scale_ub=1200.0,
2025-05-07T20:32:52.3460500Z     contiguous=False,
2025-05-07T20:32:52.3460584Z     compiled=False,
2025-05-07T20:32:52.3460655Z )
2025-05-07T20:32:52.3460880Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3461057Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:52.3461062Z 
2025-05-07T20:32:52.3461144Z     @given(
2025-05-07T20:32:52.3461266Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3461452Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3461574Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3461690Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3461804Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3461957Z     )
2025-05-07T20:32:52.3462211Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3462311Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3462389Z         self,
2025-05-07T20:32:52.3462468Z         T: int,
2025-05-07T20:32:52.3462552Z         D: int,
2025-05-07T20:32:52.3462653Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3462746Z         contiguous: bool,
2025-05-07T20:32:52.3462837Z         compiled: bool,
2025-05-07T20:32:52.3462915Z     ) -> None:
2025-05-07T20:32:52.3463010Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3463088Z     
2025-05-07T20:32:52.3463262Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3463343Z     
2025-05-07T20:32:52.3463440Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3463567Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3463660Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3463749Z         x0 = x[:, :D]
2025-05-07T20:32:52.3463834Z         x1 = x[:, D:]
2025-05-07T20:32:52.3463911Z     
2025-05-07T20:32:52.3463996Z         if contiguous:
2025-05-07T20:32:52.3464091Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3464188Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3464262Z     
2025-05-07T20:32:52.3464355Z         if scale_ub is not None:
2025-05-07T20:32:52.3464465Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3464600Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3464680Z             )
2025-05-07T20:32:52.3464766Z         else:
2025-05-07T20:32:52.3464859Z             scale_ub_tensor = None
2025-05-07T20:32:52.3464935Z     
2025-05-07T20:32:52.3465068Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3465155Z             op = silu_mul_quant
2025-05-07T20:32:52.3465239Z             if compiled:
2025-05-07T20:32:52.3465333Z                 op = torch.compile(op)
2025-05-07T20:32:52.3465437Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3465518Z     
2025-05-07T20:32:52.3465604Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3465609Z 
2025-05-07T20:32:52.3465701Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3465835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3465931Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3466028Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3466558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3466652Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3467036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3467264Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3467616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3467714Z     kernel = self.compile(
2025-05-07T20:32:52.3468111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3468283Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3468412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3468417Z 
2025-05-07T20:32:52.3468622Z self = <triton.compiler.compiler.ASTSource object at 0x7f127a0d6c30>
2025-05-07T20:32:52.3469517Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3470031Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f129592a340>}
2025-05-07T20:32:52.3470899Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3471093Z context = <triton._C.libtriton.ir.context object at 0x7f1278143130>
2025-05-07T20:32:52.3471098Z 
2025-05-07T20:32:52.3471266Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3471541Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3471648Z                            module_map=module_map)
2025-05-07T20:32:52.3471818Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3471917Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3471998Z E       ^
2025-05-07T20:32:52.3472365Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3472374Z 
2025-05-07T20:32:52.3472805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3472809Z 
2025-05-07T20:32:52.3472910Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3473140Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3473216Z     T=1,
2025-05-07T20:32:52.3473303Z     D=7168,
2025-05-07T20:32:52.3473385Z     scale_ub=None,
2025-05-07T20:32:52.3473468Z     contiguous=True,
2025-05-07T20:32:52.3473554Z     compiled=False,
2025-05-07T20:32:52.3473628Z )
2025-05-07T20:32:52.3473856Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3474027Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:52.3474031Z 
2025-05-07T20:32:52.3474109Z     @given(
2025-05-07T20:32:52.3474238Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3474360Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3474499Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3474621Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3474735Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3474811Z     )
2025-05-07T20:32:52.3475064Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3475159Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3475236Z         self,
2025-05-07T20:32:52.3475316Z         T: int,
2025-05-07T20:32:52.3475393Z         D: int,
2025-05-07T20:32:52.3475489Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3475585Z         contiguous: bool,
2025-05-07T20:32:52.3475671Z         compiled: bool,
2025-05-07T20:32:52.3475746Z     ) -> None:
2025-05-07T20:32:52.3475836Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3475905Z     
2025-05-07T20:32:52.3476076Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3476150Z     
2025-05-07T20:32:52.3476237Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3476360Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3476442Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3476519Z         x0 = x[:, :D]
2025-05-07T20:32:52.3476595Z         x1 = x[:, D:]
2025-05-07T20:32:52.3476662Z     
2025-05-07T20:32:52.3476742Z         if contiguous:
2025-05-07T20:32:52.3476835Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3476922Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3476994Z     
2025-05-07T20:32:52.3477082Z         if scale_ub is not None:
2025-05-07T20:32:52.3477266Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3477401Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3477470Z             )
2025-05-07T20:32:52.3477545Z         else:
2025-05-07T20:32:52.3477637Z             scale_ub_tensor = None
2025-05-07T20:32:52.3477785Z     
2025-05-07T20:32:52.3477911Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3478002Z             op = silu_mul_quant
2025-05-07T20:32:52.3478082Z             if compiled:
2025-05-07T20:32:52.3478175Z                 op = torch.compile(op)
2025-05-07T20:32:52.3478281Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3478352Z     
2025-05-07T20:32:52.3478442Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3478449Z 
2025-05-07T20:32:52.3478540Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3478674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3478773Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3478873Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3479394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3479493Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3479869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3480093Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3480453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3480543Z     kernel = self.compile(
2025-05-07T20:32:52.3480944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3481118Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3481247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3481251Z 
2025-05-07T20:32:52.3481455Z self = <triton.compiler.compiler.ASTSource object at 0x7f1279c9aab0>
2025-05-07T20:32:52.3482262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3482784Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1295a031a0>}
2025-05-07T20:32:52.3483575Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3483768Z context = <triton._C.libtriton.ir.context object at 0x7f12781598f0>
2025-05-07T20:32:52.3483773Z 
2025-05-07T20:32:52.3483940Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3484210Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3484319Z                            module_map=module_map)
2025-05-07T20:32:52.3484509Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3484623Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3484700Z E       ^
2025-05-07T20:32:52.3485068Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3485072Z 
2025-05-07T20:32:52.3485509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3485513Z 
2025-05-07T20:32:52.3485617Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3485943Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3486027Z     T=16384,
2025-05-07T20:32:52.3486101Z     D=7168,
2025-05-07T20:32:52.3486179Z     scale_ub=1200.0,
2025-05-07T20:32:52.3486264Z     contiguous=False,
2025-05-07T20:32:52.3486340Z     compiled=True,
2025-05-07T20:32:52.3486492Z )
2025-05-07T20:32:52.3486714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3486891Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:52.3486895Z 
2025-05-07T20:32:52.3486974Z     @given(
2025-05-07T20:32:52.3487092Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3487214Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3487375Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3487537Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3487684Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3487761Z     )
2025-05-07T20:32:52.3488021Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3488120Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3488195Z         self,
2025-05-07T20:32:52.3488269Z         T: int,
2025-05-07T20:32:52.3488348Z         D: int,
2025-05-07T20:32:52.3488447Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3488532Z         contiguous: bool,
2025-05-07T20:32:52.3488621Z         compiled: bool,
2025-05-07T20:32:52.3488700Z     ) -> None:
2025-05-07T20:32:52.3488792Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3488869Z     
2025-05-07T20:32:52.3489037Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3489108Z     
2025-05-07T20:32:52.3489203Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3489331Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3489420Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3489495Z         x0 = x[:, :D]
2025-05-07T20:32:52.3489571Z         x1 = x[:, D:]
2025-05-07T20:32:52.3489648Z     
2025-05-07T20:32:52.3489729Z         if contiguous:
2025-05-07T20:32:52.3489818Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3489906Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3489976Z     
2025-05-07T20:32:52.3490068Z         if scale_ub is not None:
2025-05-07T20:32:52.3490178Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3490311Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3490383Z             )
2025-05-07T20:32:52.3490459Z         else:
2025-05-07T20:32:52.3490551Z             scale_ub_tensor = None
2025-05-07T20:32:52.3490625Z     
2025-05-07T20:32:52.3490752Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3490836Z             op = silu_mul_quant
2025-05-07T20:32:52.3490920Z             if compiled:
2025-05-07T20:32:52.3491014Z                 op = torch.compile(op)
2025-05-07T20:32:52.3491115Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3491192Z     
2025-05-07T20:32:52.3491283Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3491288Z 
2025-05-07T20:32:52.3491383Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3491518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3491618Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3491712Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3492099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3492193Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3492717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3492812Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3493184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3493592Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3493957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3494053Z     kernel = self.compile(
2025-05-07T20:32:52.3494620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3494799Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3494935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3494940Z 
2025-05-07T20:32:52.3495146Z self = <triton.compiler.compiler.ASTSource object at 0x7f1279c9ac00>
2025-05-07T20:32:52.3495964Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3496483Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f129587da80>}
2025-05-07T20:32:52.3497282Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3497480Z context = <triton._C.libtriton.ir.context object at 0x7f12781ba430>
2025-05-07T20:32:52.3497485Z 
2025-05-07T20:32:52.3497659Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3497931Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3498035Z                            module_map=module_map)
2025-05-07T20:32:52.3498200Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3498297Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3498385Z E       ^
2025-05-07T20:32:52.3498761Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3498766Z 
2025-05-07T20:32:52.3499200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3499209Z 
2025-05-07T20:32:52.3499317Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3499543Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3499620Z     T=1,
2025-05-07T20:32:52.3499697Z     D=7168,
2025-05-07T20:32:52.3499773Z     scale_ub=None,
2025-05-07T20:32:52.3499857Z     contiguous=False,
2025-05-07T20:32:52.3499935Z     compiled=False,
2025-05-07T20:32:52.3500003Z )
2025-05-07T20:32:52.3500223Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3500386Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:52.3500395Z 
2025-05-07T20:32:52.3500470Z     @given(
2025-05-07T20:32:52.3500586Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3500683Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3500797Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3500914Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3501022Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3501093Z     )
2025-05-07T20:32:52.3501339Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3501428Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3501504Z         self,
2025-05-07T20:32:52.3501575Z         T: int,
2025-05-07T20:32:52.3501648Z         D: int,
2025-05-07T20:32:52.3501747Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3501830Z         contiguous: bool,
2025-05-07T20:32:52.3501911Z         compiled: bool,
2025-05-07T20:32:52.3501987Z     ) -> None:
2025-05-07T20:32:52.3502162Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3502239Z     
2025-05-07T20:32:52.3502408Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3502480Z     
2025-05-07T20:32:52.3502578Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3502777Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3502863Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3502946Z         x0 = x[:, :D]
2025-05-07T20:32:52.3503025Z         x1 = x[:, D:]
2025-05-07T20:32:52.3503098Z     
2025-05-07T20:32:52.3503183Z         if contiguous:
2025-05-07T20:32:52.3503273Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3503363Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3503442Z     
2025-05-07T20:32:52.3503531Z         if scale_ub is not None:
2025-05-07T20:32:52.3503636Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3503773Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3503852Z             )
2025-05-07T20:32:52.3503928Z         else:
2025-05-07T20:32:52.3504021Z             scale_ub_tensor = None
2025-05-07T20:32:52.3504095Z     
2025-05-07T20:32:52.3504222Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3504316Z             op = silu_mul_quant
2025-05-07T20:32:52.3504397Z             if compiled:
2025-05-07T20:32:52.3504502Z                 op = torch.compile(op)
2025-05-07T20:32:52.3504604Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3504673Z     
2025-05-07T20:32:52.3504763Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3504767Z 
2025-05-07T20:32:52.3504859Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3504990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3505085Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3505180Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3505710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3505806Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3506179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3506414Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3506767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3506857Z     kernel = self.compile(
2025-05-07T20:32:52.3507257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3507430Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3507563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3507567Z 
2025-05-07T20:32:52.3507776Z self = <triton.compiler.compiler.ASTSource object at 0x7f1279fcb9b0>
2025-05-07T20:32:52.3508587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3509103Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f12967d0f40>}
2025-05-07T20:32:52.3509894Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3510093Z context = <triton._C.libtriton.ir.context object at 0x7f121bd739f0>
2025-05-07T20:32:52.3510097Z 
2025-05-07T20:32:52.3510261Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3510622Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3510729Z                            module_map=module_map)
2025-05-07T20:32:52.3510889Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3511067Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3511146Z E       ^
2025-05-07T20:32:52.3511518Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3511525Z 
2025-05-07T20:32:52.3511960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3511964Z 
2025-05-07T20:32:52.3512065Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3512295Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3512373Z     T=2048,
2025-05-07T20:32:52.3512453Z     D=7168,
2025-05-07T20:32:52.3512542Z     scale_ub=None,
2025-05-07T20:32:52.3512639Z     contiguous=False,
2025-05-07T20:32:52.3512725Z     compiled=True,
2025-05-07T20:32:52.3512804Z )
2025-05-07T20:32:52.3513029Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3513219Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.3513223Z 
2025-05-07T20:32:52.3513304Z     @given(
2025-05-07T20:32:52.3513425Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3513532Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3513648Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3513768Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3513914Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3514003Z     )
2025-05-07T20:32:52.3514279Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3514376Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3514462Z         self,
2025-05-07T20:32:52.3514547Z         T: int,
2025-05-07T20:32:52.3514627Z         D: int,
2025-05-07T20:32:52.3514729Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3514826Z         contiguous: bool,
2025-05-07T20:32:52.3514919Z         compiled: bool,
2025-05-07T20:32:52.3514993Z     ) -> None:
2025-05-07T20:32:52.3515088Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3515160Z     
2025-05-07T20:32:52.3515329Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3515405Z     
2025-05-07T20:32:52.3515495Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3515617Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3515706Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3515784Z         x0 = x[:, :D]
2025-05-07T20:32:52.3515869Z         x1 = x[:, D:]
2025-05-07T20:32:52.3515939Z     
2025-05-07T20:32:52.3516020Z         if contiguous:
2025-05-07T20:32:52.3516113Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3516204Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3516274Z     
2025-05-07T20:32:52.3516368Z         if scale_ub is not None:
2025-05-07T20:32:52.3516473Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3516607Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3516691Z             )
2025-05-07T20:32:52.3516768Z         else:
2025-05-07T20:32:52.3516860Z             scale_ub_tensor = None
2025-05-07T20:32:52.3516934Z     
2025-05-07T20:32:52.3517063Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3517157Z             op = silu_mul_quant
2025-05-07T20:32:52.3517240Z             if compiled:
2025-05-07T20:32:52.3517338Z                 op = torch.compile(op)
2025-05-07T20:32:52.3517446Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3517518Z     
2025-05-07T20:32:52.3517607Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3517611Z 
2025-05-07T20:32:52.3517711Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3517950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3518049Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3518158Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3518624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3518720Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3519242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3519337Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3519718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3519946Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3520312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3520407Z     kernel = self.compile(
2025-05-07T20:32:52.3520809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3520995Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3521127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3521131Z 
2025-05-07T20:32:52.3521336Z self = <triton.compiler.compiler.ASTSource object at 0x7f127a6b2a80>
2025-05-07T20:32:52.3522149Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3522668Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f129f27f7e0>}
2025-05-07T20:32:52.3523469Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3523665Z context = <triton._C.libtriton.ir.context object at 0x7f1278cb1730>
2025-05-07T20:32:52.3523670Z 
2025-05-07T20:32:52.3523839Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3524135Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3524254Z                            module_map=module_map)
2025-05-07T20:32:52.3524429Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3524525Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3524601Z E       ^
2025-05-07T20:32:52.3524978Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3524983Z 
2025-05-07T20:32:52.3525609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3525616Z 
2025-05-07T20:32:52.3525756Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3525986Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3526063Z     T=4096,
2025-05-07T20:32:52.3526140Z     D=7168,
2025-05-07T20:32:52.3526225Z     scale_ub=None,
2025-05-07T20:32:52.3526311Z     contiguous=False,
2025-05-07T20:32:52.3526399Z     compiled=True,
2025-05-07T20:32:52.3526474Z )
2025-05-07T20:32:52.3526696Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3526875Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.3526879Z 
2025-05-07T20:32:52.3526957Z     @given(
2025-05-07T20:32:52.3527217Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3527320Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3527434Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3527555Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3527777Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3527853Z     )
2025-05-07T20:32:52.3528106Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3528201Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3528285Z         self,
2025-05-07T20:32:52.3528360Z         T: int,
2025-05-07T20:32:52.3528439Z         D: int,
2025-05-07T20:32:52.3528540Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3528630Z         contiguous: bool,
2025-05-07T20:32:52.3528714Z         compiled: bool,
2025-05-07T20:32:52.3528793Z     ) -> None:
2025-05-07T20:32:52.3528884Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3528955Z     
2025-05-07T20:32:52.3529132Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3529203Z     
2025-05-07T20:32:52.3529290Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3529417Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3529508Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3529586Z         x0 = x[:, :D]
2025-05-07T20:32:52.3529666Z         x1 = x[:, D:]
2025-05-07T20:32:52.3529734Z     
2025-05-07T20:32:52.3529816Z         if contiguous:
2025-05-07T20:32:52.3529904Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3529989Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3530066Z     
2025-05-07T20:32:52.3530152Z         if scale_ub is not None:
2025-05-07T20:32:52.3530254Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3530388Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3530457Z             )
2025-05-07T20:32:52.3530533Z         else:
2025-05-07T20:32:52.3530627Z             scale_ub_tensor = None
2025-05-07T20:32:52.3530698Z     
2025-05-07T20:32:52.3530825Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3530915Z             op = silu_mul_quant
2025-05-07T20:32:52.3530994Z             if compiled:
2025-05-07T20:32:52.3531093Z                 op = torch.compile(op)
2025-05-07T20:32:52.3531202Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3531276Z     
2025-05-07T20:32:52.3531367Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3531372Z 
2025-05-07T20:32:52.3531465Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3531596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3531696Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3531793Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3532177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3532268Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3532791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3532889Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3533262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3533491Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3533844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3533934Z     kernel = self.compile(
2025-05-07T20:32:52.3534336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3534643Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3534773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3534871Z 
2025-05-07T20:32:52.3535082Z self = <triton.compiler.compiler.ASTSource object at 0x7f127adec470>
2025-05-07T20:32:52.3535894Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3536485Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f129674f060>}
2025-05-07T20:32:52.3537277Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3537467Z context = <triton._C.libtriton.ir.context object at 0x7f1278cf3730>
2025-05-07T20:32:52.3537471Z 
2025-05-07T20:32:52.3537647Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3537915Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3538021Z                            module_map=module_map)
2025-05-07T20:32:52.3538187Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3538282Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3538360Z E       ^
2025-05-07T20:32:52.3538726Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3538731Z 
2025-05-07T20:32:52.3539160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3539167Z 
2025-05-07T20:32:52.3539268Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3539491Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3539569Z     T=16384,
2025-05-07T20:32:52.3539648Z     D=5120,
2025-05-07T20:32:52.3539727Z     scale_ub=1200.0,
2025-05-07T20:32:52.3539810Z     contiguous=False,
2025-05-07T20:32:52.3539889Z     compiled=False,
2025-05-07T20:32:52.3539959Z )
2025-05-07T20:32:52.3540179Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3540366Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:52.3540370Z 
2025-05-07T20:32:52.3540446Z     @given(
2025-05-07T20:32:52.3540566Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3540665Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3540785Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3540902Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3541017Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3541096Z     )
2025-05-07T20:32:52.3541349Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3541448Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3541529Z         self,
2025-05-07T20:32:52.3541605Z         T: int,
2025-05-07T20:32:52.3541685Z         D: int,
2025-05-07T20:32:52.3541785Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3541877Z         contiguous: bool,
2025-05-07T20:32:52.3541967Z         compiled: bool,
2025-05-07T20:32:52.3542040Z     ) -> None:
2025-05-07T20:32:52.3542135Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3542213Z     
2025-05-07T20:32:52.3542383Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3542457Z     
2025-05-07T20:32:52.3542550Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3542674Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3542761Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3542848Z         x0 = x[:, :D]
2025-05-07T20:32:52.3542927Z         x1 = x[:, D:]
2025-05-07T20:32:52.3542998Z     
2025-05-07T20:32:52.3543174Z         if contiguous:
2025-05-07T20:32:52.3543269Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3543360Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3543434Z     
2025-05-07T20:32:52.3543525Z         if scale_ub is not None:
2025-05-07T20:32:52.3543631Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3543867Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3543945Z             )
2025-05-07T20:32:52.3544025Z         else:
2025-05-07T20:32:52.3544120Z             scale_ub_tensor = None
2025-05-07T20:32:52.3544196Z     
2025-05-07T20:32:52.3544330Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3544419Z             op = silu_mul_quant
2025-05-07T20:32:52.3544502Z             if compiled:
2025-05-07T20:32:52.3544606Z                 op = torch.compile(op)
2025-05-07T20:32:52.3544712Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3544786Z     
2025-05-07T20:32:52.3544875Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3544886Z 
2025-05-07T20:32:52.3544982Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3545118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3545218Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3545322Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3545852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3545950Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3546329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3546559Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3546918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3547015Z     kernel = self.compile(
2025-05-07T20:32:52.3547422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3547597Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3547731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3547741Z 
2025-05-07T20:32:52.3547947Z self = <triton.compiler.compiler.ASTSource object at 0x7f127adef740>
2025-05-07T20:32:52.3548759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3549273Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1279473a60>}
2025-05-07T20:32:52.3550075Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3550268Z context = <triton._C.libtriton.ir.context object at 0x7f12787facb0>
2025-05-07T20:32:52.3550277Z 
2025-05-07T20:32:52.3550443Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3550719Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3550824Z                            module_map=module_map)
2025-05-07T20:32:52.3550984Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3551086Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3551161Z E       ^
2025-05-07T20:32:52.3551535Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3551539Z 
2025-05-07T20:32:52.3552054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3552059Z 
2025-05-07T20:32:52.3552164Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3552393Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3552549Z     T=16384,
2025-05-07T20:32:52.3552633Z     D=5120,
2025-05-07T20:32:52.3552720Z     scale_ub=1200.0,
2025-05-07T20:32:52.3552807Z     contiguous=True,
2025-05-07T20:32:52.3552897Z     compiled=True,
2025-05-07T20:32:52.3552974Z )
2025-05-07T20:32:52.3553197Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3553389Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:52.3553394Z 
2025-05-07T20:32:52.3553473Z     @given(
2025-05-07T20:32:52.3553595Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3553699Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3553824Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3553969Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3554097Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3554191Z     )
2025-05-07T20:32:52.3554451Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3554550Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3554630Z         self,
2025-05-07T20:32:52.3554712Z         T: int,
2025-05-07T20:32:52.3554790Z         D: int,
2025-05-07T20:32:52.3554893Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3554992Z         contiguous: bool,
2025-05-07T20:32:52.3555081Z         compiled: bool,
2025-05-07T20:32:52.3555158Z     ) -> None:
2025-05-07T20:32:52.3555254Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3555325Z     
2025-05-07T20:32:52.3555499Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3555571Z     
2025-05-07T20:32:52.3555667Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3555793Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3555881Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3555960Z         x0 = x[:, :D]
2025-05-07T20:32:52.3556042Z         x1 = x[:, D:]
2025-05-07T20:32:52.3556118Z     
2025-05-07T20:32:52.3556199Z         if contiguous:
2025-05-07T20:32:52.3556298Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3556386Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3559870Z     
2025-05-07T20:32:52.3559984Z         if scale_ub is not None:
2025-05-07T20:32:52.3560104Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3560244Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3560322Z             )
2025-05-07T20:32:52.3560405Z         else:
2025-05-07T20:32:52.3560505Z             scale_ub_tensor = None
2025-05-07T20:32:52.3560580Z     
2025-05-07T20:32:52.3560717Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3560815Z             op = silu_mul_quant
2025-05-07T20:32:52.3560902Z             if compiled:
2025-05-07T20:32:52.3561009Z                 op = torch.compile(op)
2025-05-07T20:32:52.3561114Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3561194Z     
2025-05-07T20:32:52.3561287Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3561292Z 
2025-05-07T20:32:52.3561390Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3561527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3561627Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3561725Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3562118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3562212Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3562836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3562938Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3563313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3563542Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3564006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3564117Z     kernel = self.compile(
2025-05-07T20:32:52.3564520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3564696Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3564831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3564835Z 
2025-05-07T20:32:52.3565040Z self = <triton.compiler.compiler.ASTSource object at 0x7f127b2f4470>
2025-05-07T20:32:52.3565852Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3566374Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127919efc0>}
2025-05-07T20:32:52.3567163Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3567355Z context = <triton._C.libtriton.ir.context object at 0x7f12787b9ef0>
2025-05-07T20:32:52.3567360Z 
2025-05-07T20:32:52.3567526Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3567802Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3567907Z                            module_map=module_map)
2025-05-07T20:32:52.3568067Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3568166Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3568246Z E       ^
2025-05-07T20:32:52.3568619Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3568624Z 
2025-05-07T20:32:52.3569062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3569066Z 
2025-05-07T20:32:52.3569168Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3569398Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3569476Z     T=16384,
2025-05-07T20:32:52.3569552Z     D=5120,
2025-05-07T20:32:52.3569636Z     scale_ub=None,
2025-05-07T20:32:52.3569722Z     contiguous=False,
2025-05-07T20:32:52.3569807Z     compiled=True,
2025-05-07T20:32:52.3569884Z )
2025-05-07T20:32:52.3570103Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3570282Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.3570291Z 
2025-05-07T20:32:52.3570371Z     @given(
2025-05-07T20:32:52.3570490Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3570590Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3570702Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3570819Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3570935Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3571010Z     )
2025-05-07T20:32:52.3571260Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3571355Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3571432Z         self,
2025-05-07T20:32:52.3571591Z         T: int,
2025-05-07T20:32:52.3571673Z         D: int,
2025-05-07T20:32:52.3571776Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3571866Z         contiguous: bool,
2025-05-07T20:32:52.3571950Z         compiled: bool,
2025-05-07T20:32:52.3572105Z     ) -> None:
2025-05-07T20:32:52.3572204Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3572275Z     
2025-05-07T20:32:52.3572445Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3572521Z     
2025-05-07T20:32:52.3572613Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3572735Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3572829Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3572908Z         x0 = x[:, :D]
2025-05-07T20:32:52.3572984Z         x1 = x[:, D:]
2025-05-07T20:32:52.3573056Z     
2025-05-07T20:32:52.3573139Z         if contiguous:
2025-05-07T20:32:52.3573231Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3573323Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3573401Z     
2025-05-07T20:32:52.3573492Z         if scale_ub is not None:
2025-05-07T20:32:52.3573600Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3573737Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3573818Z             )
2025-05-07T20:32:52.3573894Z         else:
2025-05-07T20:32:52.3573986Z             scale_ub_tensor = None
2025-05-07T20:32:52.3574059Z     
2025-05-07T20:32:52.3574187Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3574276Z             op = silu_mul_quant
2025-05-07T20:32:52.3574362Z             if compiled:
2025-05-07T20:32:52.3574579Z                 op = torch.compile(op)
2025-05-07T20:32:52.3574684Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3574756Z     
2025-05-07T20:32:52.3574844Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3574849Z 
2025-05-07T20:32:52.3574950Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3575087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3575186Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3575286Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3575670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3575768Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3576289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3576386Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3576767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3576994Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3577350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3577450Z     kernel = self.compile(
2025-05-07T20:32:52.3577850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3578025Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3578165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3578169Z 
2025-05-07T20:32:52.3578375Z self = <triton.compiler.compiler.ASTSource object at 0x7f1295a4d0d0>
2025-05-07T20:32:52.3579184Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3579696Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127a003a60>}
2025-05-07T20:32:52.3580578Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3580868Z context = <triton._C.libtriton.ir.context object at 0x7f127895adb0>
2025-05-07T20:32:52.3580872Z 
2025-05-07T20:32:52.3581041Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3581316Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3581426Z                            module_map=module_map)
2025-05-07T20:32:52.3581597Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3581698Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3581779Z E       ^
2025-05-07T20:32:52.3582153Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3582163Z 
2025-05-07T20:32:52.3582597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3582602Z 
2025-05-07T20:32:52.3582704Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3582938Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3583014Z     T=2048,
2025-05-07T20:32:52.3583093Z     D=5120,
2025-05-07T20:32:52.3583174Z     scale_ub=None,
2025-05-07T20:32:52.3583258Z     contiguous=False,
2025-05-07T20:32:52.3583343Z     compiled=True,
2025-05-07T20:32:52.3583415Z )
2025-05-07T20:32:52.3583637Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3583817Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.3583821Z 
2025-05-07T20:32:52.3583899Z     @given(
2025-05-07T20:32:52.3584016Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3584123Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3584239Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3584362Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3584476Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3584554Z     )
2025-05-07T20:32:52.3584808Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3584900Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3584974Z         self,
2025-05-07T20:32:52.3585053Z         T: int,
2025-05-07T20:32:52.3585129Z         D: int,
2025-05-07T20:32:52.3585224Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3585317Z         contiguous: bool,
2025-05-07T20:32:52.3585401Z         compiled: bool,
2025-05-07T20:32:52.3585479Z     ) -> None:
2025-05-07T20:32:52.3585583Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3585661Z     
2025-05-07T20:32:52.3585837Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3585918Z     
2025-05-07T20:32:52.3586017Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3586145Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3586235Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3586320Z         x0 = x[:, :D]
2025-05-07T20:32:52.3586406Z         x1 = x[:, D:]
2025-05-07T20:32:52.3586480Z     
2025-05-07T20:32:52.3586566Z         if contiguous:
2025-05-07T20:32:52.3586667Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3586760Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3586835Z     
2025-05-07T20:32:52.3586932Z         if scale_ub is not None:
2025-05-07T20:32:52.3587040Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3587179Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3587259Z             )
2025-05-07T20:32:52.3587339Z         else:
2025-05-07T20:32:52.3587437Z             scale_ub_tensor = None
2025-05-07T20:32:52.3587512Z     
2025-05-07T20:32:52.3587723Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3587815Z             op = silu_mul_quant
2025-05-07T20:32:52.3587900Z             if compiled:
2025-05-07T20:32:52.3587997Z                 op = torch.compile(op)
2025-05-07T20:32:52.3588176Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3588247Z     
2025-05-07T20:32:52.3588336Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3588345Z 
2025-05-07T20:32:52.3588439Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3588569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3588671Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3588768Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3589218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3589358Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3589920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3590020Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3590396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3590627Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3590985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3591077Z     kernel = self.compile(
2025-05-07T20:32:52.3591475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3591652Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3591781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3591785Z 
2025-05-07T20:32:52.3591996Z self = <triton.compiler.compiler.ASTSource object at 0x7f127b2f7da0>
2025-05-07T20:32:52.3592800Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3593313Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1279cf6d40>}
2025-05-07T20:32:52.3594106Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3594297Z context = <triton._C.libtriton.ir.context object at 0x7f121bca6c30>
2025-05-07T20:32:52.3594302Z 
2025-05-07T20:32:52.3594469Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3594742Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3594849Z                            module_map=module_map)
2025-05-07T20:32:52.3595014Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3595115Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3595193Z E       ^
2025-05-07T20:32:52.3595560Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3595565Z 
2025-05-07T20:32:52.3595995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3596000Z 
2025-05-07T20:32:52.3596110Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3596332Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3596416Z     T=2048,
2025-05-07T20:32:52.3596492Z     D=5120,
2025-05-07T20:32:52.3596668Z     scale_ub=1200.0,
2025-05-07T20:32:52.3596760Z     contiguous=False,
2025-05-07T20:32:52.3596841Z     compiled=True,
2025-05-07T20:32:52.3596912Z )
2025-05-07T20:32:52.3597136Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3597385Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:52.3597389Z 
2025-05-07T20:32:52.3597468Z     @given(
2025-05-07T20:32:52.3597589Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3597688Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3597806Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3597924Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3598035Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3598116Z     )
2025-05-07T20:32:52.3598362Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3598453Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3598536Z         self,
2025-05-07T20:32:52.3598611Z         T: int,
2025-05-07T20:32:52.3598687Z         D: int,
2025-05-07T20:32:52.3598790Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3598878Z         contiguous: bool,
2025-05-07T20:32:52.3598967Z         compiled: bool,
2025-05-07T20:32:52.3599046Z     ) -> None:
2025-05-07T20:32:52.3599141Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3599213Z     
2025-05-07T20:32:52.3599383Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3599456Z     
2025-05-07T20:32:52.3599550Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3599675Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3599761Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3599841Z         x0 = x[:, :D]
2025-05-07T20:32:52.3599919Z         x1 = x[:, D:]
2025-05-07T20:32:52.3599989Z     
2025-05-07T20:32:52.3600075Z         if contiguous:
2025-05-07T20:32:52.3600172Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3600260Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3600332Z     
2025-05-07T20:32:52.3600421Z         if scale_ub is not None:
2025-05-07T20:32:52.3600525Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3600664Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3600737Z             )
2025-05-07T20:32:52.3600815Z         else:
2025-05-07T20:32:52.3600906Z             scale_ub_tensor = None
2025-05-07T20:32:52.3600976Z     
2025-05-07T20:32:52.3601106Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3601197Z             op = silu_mul_quant
2025-05-07T20:32:52.3601283Z             if compiled:
2025-05-07T20:32:52.3601383Z                 op = torch.compile(op)
2025-05-07T20:32:52.3601490Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3601565Z     
2025-05-07T20:32:52.3601658Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3601662Z 
2025-05-07T20:32:52.3601761Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3601893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3601993Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3602092Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3602485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3602577Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3603098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3603201Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3603575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3603809Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3604308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3604403Z     kernel = self.compile(
2025-05-07T20:32:52.3604807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3605060Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3605196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3605200Z 
2025-05-07T20:32:52.3605412Z self = <triton.compiler.compiler.ASTSource object at 0x7f12967d89e0>
2025-05-07T20:32:52.3606226Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3606756Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ac1a200>}
2025-05-07T20:32:52.3607553Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3607756Z context = <triton._C.libtriton.ir.context object at 0x7f12789d8d70>
2025-05-07T20:32:52.3607760Z 
2025-05-07T20:32:52.3607934Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3608211Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3608328Z                            module_map=module_map)
2025-05-07T20:32:52.3608492Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3608596Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3608679Z E       ^
2025-05-07T20:32:52.3609051Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3609056Z 
2025-05-07T20:32:52.3609497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3609501Z 
2025-05-07T20:32:52.3609611Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3609844Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3609928Z     T=4096,
2025-05-07T20:32:52.3610008Z     D=5120,
2025-05-07T20:32:52.3610091Z     scale_ub=1200.0,
2025-05-07T20:32:52.3610176Z     contiguous=True,
2025-05-07T20:32:52.3610265Z     compiled=True,
2025-05-07T20:32:52.3610341Z )
2025-05-07T20:32:52.3610561Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3610737Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:52.3610741Z 
2025-05-07T20:32:52.3610820Z     @given(
2025-05-07T20:32:52.3610943Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3611043Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3611163Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3611278Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3611403Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3611475Z     )
2025-05-07T20:32:52.3611724Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3611820Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3611895Z         self,
2025-05-07T20:32:52.3611970Z         T: int,
2025-05-07T20:32:52.3612048Z         D: int,
2025-05-07T20:32:52.3612148Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3612236Z         contiguous: bool,
2025-05-07T20:32:52.3612321Z         compiled: bool,
2025-05-07T20:32:52.3612397Z     ) -> None:
2025-05-07T20:32:52.3612491Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3612567Z     
2025-05-07T20:32:52.3612883Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3612962Z     
2025-05-07T20:32:52.3613070Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3613196Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3613371Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3613454Z         x0 = x[:, :D]
2025-05-07T20:32:52.3613533Z         x1 = x[:, D:]
2025-05-07T20:32:52.3613616Z     
2025-05-07T20:32:52.3613703Z         if contiguous:
2025-05-07T20:32:52.3613797Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3613919Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3613997Z     
2025-05-07T20:32:52.3614110Z         if scale_ub is not None:
2025-05-07T20:32:52.3614228Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3614367Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3614556Z             )
2025-05-07T20:32:52.3614635Z         else:
2025-05-07T20:32:52.3614737Z             scale_ub_tensor = None
2025-05-07T20:32:52.3614816Z     
2025-05-07T20:32:52.3614944Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3615034Z             op = silu_mul_quant
2025-05-07T20:32:52.3615122Z             if compiled:
2025-05-07T20:32:52.3615227Z                 op = torch.compile(op)
2025-05-07T20:32:52.3615333Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3615414Z     
2025-05-07T20:32:52.3615507Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3615511Z 
2025-05-07T20:32:52.3615607Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3615741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3615840Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3615940Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3616327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3616419Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3616952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3617049Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3617427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3617664Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3618023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3618122Z     kernel = self.compile(
2025-05-07T20:32:52.3618524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3618699Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3618836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3618844Z 
2025-05-07T20:32:52.3619049Z self = <triton.compiler.compiler.ASTSource object at 0x7f12967da300>
2025-05-07T20:32:52.3619863Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3620386Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ac19080>}
2025-05-07T20:32:52.3621180Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3621388Z context = <triton._C.libtriton.ir.context object at 0x7f121bcf0cb0>
2025-05-07T20:32:52.3621393Z 
2025-05-07T20:32:52.3621644Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3621924Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3622035Z                            module_map=module_map)
2025-05-07T20:32:52.3622348Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3622454Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3622534Z E       ^
2025-05-07T20:32:52.3622917Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3622922Z 
2025-05-07T20:32:52.3623357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3623362Z 
2025-05-07T20:32:52.3623463Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3623694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3623778Z     T=128,
2025-05-07T20:32:52.3623861Z     D=5120,
2025-05-07T20:32:52.3623942Z     scale_ub=1200.0,
2025-05-07T20:32:52.3624030Z     contiguous=False,
2025-05-07T20:32:52.3624111Z     compiled=True,
2025-05-07T20:32:52.3624182Z )
2025-05-07T20:32:52.3624410Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3624588Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:52.3624593Z 
2025-05-07T20:32:52.3624668Z     @given(
2025-05-07T20:32:52.3624790Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3624888Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3625003Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3625120Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3625233Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3625309Z     )
2025-05-07T20:32:52.3625837Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3625955Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3626036Z         self,
2025-05-07T20:32:52.3626110Z         T: int,
2025-05-07T20:32:52.3626186Z         D: int,
2025-05-07T20:32:52.3626286Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3626380Z         contiguous: bool,
2025-05-07T20:32:52.3626469Z         compiled: bool,
2025-05-07T20:32:52.3626550Z     ) -> None:
2025-05-07T20:32:52.3626643Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3626717Z     
2025-05-07T20:32:52.3626887Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3626959Z     
2025-05-07T20:32:52.3627053Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3627174Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3627262Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3627348Z         x0 = x[:, :D]
2025-05-07T20:32:52.3627426Z         x1 = x[:, D:]
2025-05-07T20:32:52.3627497Z     
2025-05-07T20:32:52.3627589Z         if contiguous:
2025-05-07T20:32:52.3627682Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3627770Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3627845Z     
2025-05-07T20:32:52.3627932Z         if scale_ub is not None:
2025-05-07T20:32:52.3628048Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3628181Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3628254Z             )
2025-05-07T20:32:52.3628333Z         else:
2025-05-07T20:32:52.3628425Z             scale_ub_tensor = None
2025-05-07T20:32:52.3628496Z     
2025-05-07T20:32:52.3628626Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3628714Z             op = silu_mul_quant
2025-05-07T20:32:52.3628799Z             if compiled:
2025-05-07T20:32:52.3628903Z                 op = torch.compile(op)
2025-05-07T20:32:52.3629008Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3629079Z     
2025-05-07T20:32:52.3629171Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3629327Z 
2025-05-07T20:32:52.3629429Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3629565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3629668Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3629878Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3630263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3630356Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3630876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3630977Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3631350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3631579Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3631938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3632030Z     kernel = self.compile(
2025-05-07T20:32:52.3632433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3632613Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3632745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3632750Z 
2025-05-07T20:32:52.3632958Z self = <triton.compiler.compiler.ASTSource object at 0x7f129532b020>
2025-05-07T20:32:52.3633768Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3634292Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ad1f4c0>}
2025-05-07T20:32:52.3635085Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3635286Z context = <triton._C.libtriton.ir.context object at 0x7f121bc867b0>
2025-05-07T20:32:52.3635290Z 
2025-05-07T20:32:52.3635458Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3635729Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3635839Z                            module_map=module_map)
2025-05-07T20:32:52.3636003Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3636101Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3636177Z E       ^
2025-05-07T20:32:52.3636546Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3636550Z 
2025-05-07T20:32:52.3636989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3636998Z 
2025-05-07T20:32:52.3637100Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3637326Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3637404Z     T=16384,
2025-05-07T20:32:52.3637482Z     D=7168,
2025-05-07T20:32:52.3637570Z     scale_ub=1200.0,
2025-05-07T20:32:52.3637656Z     contiguous=True,
2025-05-07T20:32:52.3637741Z     compiled=True,
2025-05-07T20:32:52.3637822Z )
2025-05-07T20:32:52.3638048Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3638228Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:52.3638232Z 
2025-05-07T20:32:52.3638400Z     @given(
2025-05-07T20:32:52.3638525Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3638627Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3638747Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3638950Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3639069Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3639148Z     )
2025-05-07T20:32:52.3639401Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3639501Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3639581Z         self,
2025-05-07T20:32:52.3639661Z         T: int,
2025-05-07T20:32:52.3639746Z         D: int,
2025-05-07T20:32:52.3639846Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3639937Z         contiguous: bool,
2025-05-07T20:32:52.3640031Z         compiled: bool,
2025-05-07T20:32:52.3640108Z     ) -> None:
2025-05-07T20:32:52.3640208Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3640284Z     
2025-05-07T20:32:52.3640455Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3640531Z     
2025-05-07T20:32:52.3640628Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3640757Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3640851Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3640930Z         x0 = x[:, :D]
2025-05-07T20:32:52.3641007Z         x1 = x[:, D:]
2025-05-07T20:32:52.3641079Z     
2025-05-07T20:32:52.3641164Z         if contiguous:
2025-05-07T20:32:52.3641254Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3641344Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3641415Z     
2025-05-07T20:32:52.3641502Z         if scale_ub is not None:
2025-05-07T20:32:52.3641609Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3641744Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3641825Z             )
2025-05-07T20:32:52.3641903Z         else:
2025-05-07T20:32:52.3641997Z             scale_ub_tensor = None
2025-05-07T20:32:52.3642071Z     
2025-05-07T20:32:52.3642199Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3642288Z             op = silu_mul_quant
2025-05-07T20:32:52.3642382Z             if compiled:
2025-05-07T20:32:52.3642480Z                 op = torch.compile(op)
2025-05-07T20:32:52.3642584Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3642659Z     
2025-05-07T20:32:52.3642750Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3642755Z 
2025-05-07T20:32:52.3642851Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3642986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3643086Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3643188Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3643571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3643668Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3644193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3644292Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3644672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3644904Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3645258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3645355Z     kernel = self.compile(
2025-05-07T20:32:52.3645758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3645934Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3646184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3646189Z 
2025-05-07T20:32:52.3646400Z self = <triton.compiler.compiler.ASTSource object at 0x7f127949b3b0>
2025-05-07T20:32:52.3647213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3647802Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ad1f880>}
2025-05-07T20:32:52.3648599Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3648793Z context = <triton._C.libtriton.ir.context object at 0x7f121bd0f530>
2025-05-07T20:32:52.3648803Z 
2025-05-07T20:32:52.3648968Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3649242Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3649354Z                            module_map=module_map)
2025-05-07T20:32:52.3649514Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3649613Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3649688Z E       ^
2025-05-07T20:32:52.3650055Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3650060Z 
2025-05-07T20:32:52.3650496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3650500Z 
2025-05-07T20:32:52.3650602Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3650836Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3650911Z     T=16384,
2025-05-07T20:32:52.3650988Z     D=5120,
2025-05-07T20:32:52.3651070Z     scale_ub=1200.0,
2025-05-07T20:32:52.3651153Z     contiguous=True,
2025-05-07T20:32:52.3651241Z     compiled=False,
2025-05-07T20:32:52.3651316Z )
2025-05-07T20:32:52.3651541Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3651721Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:52.3651726Z 
2025-05-07T20:32:52.3651802Z     @given(
2025-05-07T20:32:52.3651918Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3652018Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3652132Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3652250Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3652362Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3652435Z     )
2025-05-07T20:32:52.3652691Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3652783Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3652858Z         self,
2025-05-07T20:32:52.3652937Z         T: int,
2025-05-07T20:32:52.3653016Z         D: int,
2025-05-07T20:32:52.3653114Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3653207Z         contiguous: bool,
2025-05-07T20:32:52.3653292Z         compiled: bool,
2025-05-07T20:32:52.3653369Z     ) -> None:
2025-05-07T20:32:52.3653469Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3653545Z     
2025-05-07T20:32:52.3653720Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3653802Z     
2025-05-07T20:32:52.3653897Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3654028Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3654124Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3654207Z         x0 = x[:, :D]
2025-05-07T20:32:52.3654493Z         x1 = x[:, D:]
2025-05-07T20:32:52.3654573Z     
2025-05-07T20:32:52.3654661Z         if contiguous:
2025-05-07T20:32:52.3654762Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3654854Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3654929Z     
2025-05-07T20:32:52.3655106Z         if scale_ub is not None:
2025-05-07T20:32:52.3655214Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3655352Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3655432Z             )
2025-05-07T20:32:52.3655510Z         else:
2025-05-07T20:32:52.3655610Z             scale_ub_tensor = None
2025-05-07T20:32:52.3655684Z     
2025-05-07T20:32:52.3655813Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3655906Z             op = silu_mul_quant
2025-05-07T20:32:52.3655990Z             if compiled:
2025-05-07T20:32:52.3656088Z                 op = torch.compile(op)
2025-05-07T20:32:52.3656194Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3656272Z     
2025-05-07T20:32:52.3656367Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3656371Z 
2025-05-07T20:32:52.3656471Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3656600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3656706Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3656804Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3657329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3657429Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3657803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3658032Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3658393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3658484Z     kernel = self.compile(
2025-05-07T20:32:52.3658885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3659057Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3659191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3659195Z 
2025-05-07T20:32:52.3659402Z self = <triton.compiler.compiler.ASTSource object at 0x7f12799cb7a0>
2025-05-07T20:32:52.3660209Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3660723Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127ad1c720>}
2025-05-07T20:32:52.3661518Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3661716Z context = <triton._C.libtriton.ir.context object at 0x7f121bd0b1f0>
2025-05-07T20:32:52.3661723Z 
2025-05-07T20:32:52.3661888Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3662156Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3662265Z                            module_map=module_map)
2025-05-07T20:32:52.3662425Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3662524Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3662603Z E       ^
2025-05-07T20:32:52.3662968Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3663056Z 
2025-05-07T20:32:52.3663499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3663504Z 
2025-05-07T20:32:52.3663609Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3663942Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3664042Z     T=1,
2025-05-07T20:32:52.3664132Z     D=7168,
2025-05-07T20:32:52.3664212Z     scale_ub=1200.0,
2025-05-07T20:32:52.3664301Z     contiguous=False,
2025-05-07T20:32:52.3664384Z     compiled=False,
2025-05-07T20:32:52.3664456Z )
2025-05-07T20:32:52.3664677Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3664845Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:52.3664850Z 
2025-05-07T20:32:52.3664928Z     @given(
2025-05-07T20:32:52.3665048Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3665154Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3665275Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3665394Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3665509Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3665594Z     )
2025-05-07T20:32:52.3665844Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3665942Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3666023Z         self,
2025-05-07T20:32:52.3666101Z         T: int,
2025-05-07T20:32:52.3666184Z         D: int,
2025-05-07T20:32:52.3666284Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3666375Z         contiguous: bool,
2025-05-07T20:32:52.3666467Z         compiled: bool,
2025-05-07T20:32:52.3666544Z     ) -> None:
2025-05-07T20:32:52.3666638Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3666715Z     
2025-05-07T20:32:52.3666887Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3666963Z     
2025-05-07T20:32:52.3667058Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3667179Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3667268Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3667362Z         x0 = x[:, :D]
2025-05-07T20:32:52.3667440Z         x1 = x[:, D:]
2025-05-07T20:32:52.3667518Z     
2025-05-07T20:32:52.3667601Z         if contiguous:
2025-05-07T20:32:52.3667692Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3667786Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3667856Z     
2025-05-07T20:32:52.3667945Z         if scale_ub is not None:
2025-05-07T20:32:52.3668049Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3668182Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3668258Z             )
2025-05-07T20:32:52.3668338Z         else:
2025-05-07T20:32:52.3668430Z             scale_ub_tensor = None
2025-05-07T20:32:52.3668502Z     
2025-05-07T20:32:52.3668640Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3668730Z             op = silu_mul_quant
2025-05-07T20:32:52.3668812Z             if compiled:
2025-05-07T20:32:52.3668914Z                 op = torch.compile(op)
2025-05-07T20:32:52.3669024Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3669097Z     
2025-05-07T20:32:52.3669187Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3669191Z 
2025-05-07T20:32:52.3669286Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3669421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3669521Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3669619Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3670147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3670242Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3670714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3670948Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3671308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3671505Z     kernel = self.compile(
2025-05-07T20:32:52.3671905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3672080Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3672215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3672220Z 
2025-05-07T20:32:52.3672423Z self = <triton.compiler.compiler.ASTSource object at 0x7f12799c8bc0>
2025-05-07T20:32:52.3673242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3673755Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127b2f05e0>}
2025-05-07T20:32:52.3674558Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3674755Z context = <triton._C.libtriton.ir.context object at 0x7f121bd858b0>
2025-05-07T20:32:52.3674760Z 
2025-05-07T20:32:52.3674927Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3675204Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3675309Z                            module_map=module_map)
2025-05-07T20:32:52.3675479Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3675576Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3675652Z E       ^
2025-05-07T20:32:52.3676020Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3676029Z 
2025-05-07T20:32:52.3676462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3676467Z 
2025-05-07T20:32:52.3676570Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3676797Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3676876Z     T=4096,
2025-05-07T20:32:52.3676958Z     D=7168,
2025-05-07T20:32:52.3677046Z     scale_ub=1200.0,
2025-05-07T20:32:52.3677135Z     contiguous=False,
2025-05-07T20:32:52.3677226Z     compiled=True,
2025-05-07T20:32:52.3677301Z )
2025-05-07T20:32:52.3677528Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3677714Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:52.3677718Z 
2025-05-07T20:32:52.3677798Z     @given(
2025-05-07T20:32:52.3677921Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3678026Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3678141Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3678263Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3678379Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3678455Z     )
2025-05-07T20:32:52.3678709Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3678808Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3678886Z         self,
2025-05-07T20:32:52.3678974Z         T: int,
2025-05-07T20:32:52.3679052Z         D: int,
2025-05-07T20:32:52.3679237Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3679335Z         contiguous: bool,
2025-05-07T20:32:52.3679426Z         compiled: bool,
2025-05-07T20:32:52.3679502Z     ) -> None:
2025-05-07T20:32:52.3679600Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3679671Z     
2025-05-07T20:32:52.3679924Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3680001Z     
2025-05-07T20:32:52.3680096Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3680228Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3680321Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3680402Z         x0 = x[:, :D]
2025-05-07T20:32:52.3680490Z         x1 = x[:, D:]
2025-05-07T20:32:52.3680567Z     
2025-05-07T20:32:52.3680654Z         if contiguous:
2025-05-07T20:32:52.3680751Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3680841Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3680917Z     
2025-05-07T20:32:52.3681012Z         if scale_ub is not None:
2025-05-07T20:32:52.3681125Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3681264Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3681342Z             )
2025-05-07T20:32:52.3681420Z         else:
2025-05-07T20:32:52.3681525Z             scale_ub_tensor = None
2025-05-07T20:32:52.3681596Z     
2025-05-07T20:32:52.3681723Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3681817Z             op = silu_mul_quant
2025-05-07T20:32:52.3681900Z             if compiled:
2025-05-07T20:32:52.3685455Z                 op = torch.compile(op)
2025-05-07T20:32:52.3685586Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3685668Z     
2025-05-07T20:32:52.3685762Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3685767Z 
2025-05-07T20:32:52.3685867Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3686010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3686123Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3686228Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3686630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3686724Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3687263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3687360Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3687737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3687967Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3688329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3688422Z     kernel = self.compile(
2025-05-07T20:32:52.3688836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3689013Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3689147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3689157Z 
2025-05-07T20:32:52.3689368Z self = <triton.compiler.compiler.ASTSource object at 0x7f127a00af30>
2025-05-07T20:32:52.3690185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3690708Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f127b0aafc0>}
2025-05-07T20:32:52.3692013Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3692224Z context = <triton._C.libtriton.ir.context object at 0x7f121be22db0>
2025-05-07T20:32:52.3692228Z 
2025-05-07T20:32:52.3692472Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3692748Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3692855Z                            module_map=module_map)
2025-05-07T20:32:52.3693020Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3693123Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3693203Z E       ^
2025-05-07T20:32:52.3693577Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3693582Z 
2025-05-07T20:32:52.3694025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3694030Z 
2025-05-07T20:32:52.3694131Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3694362Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3694535Z     T=128,
2025-05-07T20:32:52.3694617Z     D=7168,
2025-05-07T20:32:52.3694708Z     scale_ub=1200.0,
2025-05-07T20:32:52.3694797Z     contiguous=False,
2025-05-07T20:32:52.3694883Z     compiled=True,
2025-05-07T20:32:52.3694962Z )
2025-05-07T20:32:52.3695189Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3695370Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:52.3695380Z 
2025-05-07T20:32:52.3695461Z     @given(
2025-05-07T20:32:52.3695583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3695688Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3695804Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3695927Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3696048Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3696124Z     )
2025-05-07T20:32:52.3696377Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3696486Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3696564Z         self,
2025-05-07T20:32:52.3696646Z         T: int,
2025-05-07T20:32:52.3696728Z         D: int,
2025-05-07T20:32:52.3696828Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3696923Z         contiguous: bool,
2025-05-07T20:32:52.3697010Z         compiled: bool,
2025-05-07T20:32:52.3697093Z     ) -> None:
2025-05-07T20:32:52.3697190Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3697264Z     
2025-05-07T20:32:52.3697434Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3697512Z     
2025-05-07T20:32:52.3697604Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3697735Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3697829Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3697911Z         x0 = x[:, :D]
2025-05-07T20:32:52.3697988Z         x1 = x[:, D:]
2025-05-07T20:32:52.3698062Z     
2025-05-07T20:32:52.3698149Z         if contiguous:
2025-05-07T20:32:52.3698247Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3698335Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3698408Z     
2025-05-07T20:32:52.3698500Z         if scale_ub is not None:
2025-05-07T20:32:52.3698605Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3698735Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3698815Z             )
2025-05-07T20:32:52.3698895Z         else:
2025-05-07T20:32:52.3698992Z             scale_ub_tensor = None
2025-05-07T20:32:52.3699070Z     
2025-05-07T20:32:52.3699197Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3699289Z             op = silu_mul_quant
2025-05-07T20:32:52.3699466Z             if compiled:
2025-05-07T20:32:52.3699567Z                 op = torch.compile(op)
2025-05-07T20:32:52.3699670Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3699743Z     
2025-05-07T20:32:52.3699908Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3699912Z 
2025-05-07T20:32:52.3700009Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3700139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3700239Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3700339Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3700724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3700817Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3701338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3701445Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3701819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3702045Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3702403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3702501Z     kernel = self.compile(
2025-05-07T20:32:52.3702900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3703079Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3703208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3703212Z 
2025-05-07T20:32:52.3703420Z self = <triton.compiler.compiler.ASTSource object at 0x7f127a0f65a0>
2025-05-07T20:32:52.3704235Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3704746Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f129675d800>}
2025-05-07T20:32:52.3705547Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3705738Z context = <triton._C.libtriton.ir.context object at 0x7f121b8f4d30>
2025-05-07T20:32:52.3705743Z 
2025-05-07T20:32:52.3705907Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3706180Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3706287Z                            module_map=module_map)
2025-05-07T20:32:52.3706451Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3706549Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3706626Z E       ^
2025-05-07T20:32:52.3707002Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3707006Z 
2025-05-07T20:32:52.3707436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3707440Z 
2025-05-07T20:32:52.3707545Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3707771Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3707846Z     T=2048,
2025-05-07T20:32:52.3707926Z     D=7168,
2025-05-07T20:32:52.3708006Z     scale_ub=None,
2025-05-07T20:32:52.3708091Z     contiguous=True,
2025-05-07T20:32:52.3708177Z     compiled=True,
2025-05-07T20:32:52.3708328Z )
2025-05-07T20:32:52.3708552Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3708727Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:52.3708827Z 
2025-05-07T20:32:52.3708905Z     @given(
2025-05-07T20:32:52.3709027Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3709124Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3709235Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3709354Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3709467Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3709539Z     )
2025-05-07T20:32:52.3709789Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3709882Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3709958Z         self,
2025-05-07T20:32:52.3710037Z         T: int,
2025-05-07T20:32:52.3710119Z         D: int,
2025-05-07T20:32:52.3710216Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3710308Z         contiguous: bool,
2025-05-07T20:32:52.3710393Z         compiled: bool,
2025-05-07T20:32:52.3710475Z     ) -> None:
2025-05-07T20:32:52.3710578Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3710651Z     
2025-05-07T20:32:52.3710824Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3710896Z     
2025-05-07T20:32:52.3710986Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3711113Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3711201Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3711279Z         x0 = x[:, :D]
2025-05-07T20:32:52.3711362Z         x1 = x[:, D:]
2025-05-07T20:32:52.3711434Z     
2025-05-07T20:32:52.3711518Z         if contiguous:
2025-05-07T20:32:52.3711616Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3711706Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3711778Z     
2025-05-07T20:32:52.3711876Z         if scale_ub is not None:
2025-05-07T20:32:52.3711983Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3712118Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3712192Z             )
2025-05-07T20:32:52.3712275Z         else:
2025-05-07T20:32:52.3712370Z             scale_ub_tensor = None
2025-05-07T20:32:52.3712441Z     
2025-05-07T20:32:52.3712569Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3712663Z             op = silu_mul_quant
2025-05-07T20:32:52.3712748Z             if compiled:
2025-05-07T20:32:52.3712847Z                 op = torch.compile(op)
2025-05-07T20:32:52.3712955Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3713027Z     
2025-05-07T20:32:52.3713116Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3713123Z 
2025-05-07T20:32:52.3713218Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3713352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3713452Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3713549Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3713932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3714031Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3714550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3714650Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3715022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3715247Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3715602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3715694Z     kernel = self.compile(
2025-05-07T20:32:52.3716178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3716359Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3716561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3716565Z 
2025-05-07T20:32:52.3716776Z self = <triton.compiler.compiler.ASTSource object at 0x7f1279c58fe0>
2025-05-07T20:32:52.3717586Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3718100Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1278524400>}
2025-05-07T20:32:52.3718905Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3719100Z context = <triton._C.libtriton.ir.context object at 0x7f121ba9b430>
2025-05-07T20:32:52.3719108Z 
2025-05-07T20:32:52.3719283Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3719555Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3719662Z                            module_map=module_map)
2025-05-07T20:32:52.3719830Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3719931Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3720013Z E       ^
2025-05-07T20:32:52.3720382Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3720386Z 
2025-05-07T20:32:52.3720822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3720826Z 
2025-05-07T20:32:52.3720933Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3721155Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3721240Z     T=16384,
2025-05-07T20:32:52.3721318Z     D=5120,
2025-05-07T20:32:52.3721404Z     scale_ub=None,
2025-05-07T20:32:52.3721499Z     contiguous=False,
2025-05-07T20:32:52.3721586Z     compiled=False,
2025-05-07T20:32:52.3721662Z )
2025-05-07T20:32:52.3721887Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3722069Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:52.3722073Z 
2025-05-07T20:32:52.3722153Z     @given(
2025-05-07T20:32:52.3722279Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3722379Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3722505Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3722623Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3722737Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3722820Z     )
2025-05-07T20:32:52.3723074Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3723170Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3723256Z         self,
2025-05-07T20:32:52.3723335Z         T: int,
2025-05-07T20:32:52.3723414Z         D: int,
2025-05-07T20:32:52.3723518Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3723611Z         contiguous: bool,
2025-05-07T20:32:52.3723701Z         compiled: bool,
2025-05-07T20:32:52.3723782Z     ) -> None:
2025-05-07T20:32:52.3723875Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3723953Z     
2025-05-07T20:32:52.3724119Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3724192Z     
2025-05-07T20:32:52.3724369Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3724519Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3726745Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3726894Z 
2025-05-07T20:32:52.3727020Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:52.3727025Z 
2025-05-07T20:32:52.3727131Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3727372Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3727450Z     T=4096,
2025-05-07T20:32:52.3727532Z     D=7168,
2025-05-07T20:32:52.3727620Z     scale_ub=1200.0,
2025-05-07T20:32:52.3727704Z     contiguous=True,
2025-05-07T20:32:52.3727792Z     compiled=True,
2025-05-07T20:32:52.3727872Z )
2025-05-07T20:32:52.3728098Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3728275Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:52.3728280Z 
2025-05-07T20:32:52.3728359Z     @given(
2025-05-07T20:32:52.3728478Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3728583Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3728699Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3728817Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3728933Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3729009Z     )
2025-05-07T20:32:52.3729270Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3729368Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3729447Z         self,
2025-05-07T20:32:52.3729527Z         T: int,
2025-05-07T20:32:52.3729606Z         D: int,
2025-05-07T20:32:52.3729709Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3729802Z         contiguous: bool,
2025-05-07T20:32:52.3729888Z         compiled: bool,
2025-05-07T20:32:52.3729969Z     ) -> None:
2025-05-07T20:32:52.3730067Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3730138Z     
2025-05-07T20:32:52.3730308Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3730388Z     
2025-05-07T20:32:52.3730478Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3730603Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3732516Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3732528Z 
2025-05-07T20:32:52.3732653Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:52.3732657Z 
2025-05-07T20:32:52.3732760Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3732981Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3733061Z     T=16384,
2025-05-07T20:32:52.3733137Z     D=7168,
2025-05-07T20:32:52.3733217Z     scale_ub=None,
2025-05-07T20:32:52.3733305Z     contiguous=False,
2025-05-07T20:32:52.3733388Z     compiled=False,
2025-05-07T20:32:52.3733461Z )
2025-05-07T20:32:52.3733804Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3734008Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:52.3734014Z 
2025-05-07T20:32:52.3734194Z     @given(
2025-05-07T20:32:52.3734312Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3734467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3734590Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3734710Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3734829Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3734909Z     )
2025-05-07T20:32:52.3735162Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3735261Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3735347Z         self,
2025-05-07T20:32:52.3735425Z         T: int,
2025-05-07T20:32:52.3735508Z         D: int,
2025-05-07T20:32:52.3735613Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3735705Z         contiguous: bool,
2025-05-07T20:32:52.3735797Z         compiled: bool,
2025-05-07T20:32:52.3735873Z     ) -> None:
2025-05-07T20:32:52.3735969Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3736052Z     
2025-05-07T20:32:52.3736220Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3738141Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3738151Z 
2025-05-07T20:32:52.3738265Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:52.3738269Z 
2025-05-07T20:32:52.3738375Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3738601Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3738681Z     T=2048,
2025-05-07T20:32:52.3738762Z     D=7168,
2025-05-07T20:32:52.3738843Z     scale_ub=1200.0,
2025-05-07T20:32:52.3738928Z     contiguous=True,
2025-05-07T20:32:52.3739018Z     compiled=True,
2025-05-07T20:32:52.3739093Z )
2025-05-07T20:32:52.3739310Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3739485Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:52.3739489Z 
2025-05-07T20:32:52.3739564Z     @given(
2025-05-07T20:32:52.3739677Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3739781Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3739896Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3740015Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3740124Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3740199Z     )
2025-05-07T20:32:52.3740449Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3740547Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3740622Z         self,
2025-05-07T20:32:52.3740701Z         T: int,
2025-05-07T20:32:52.3740777Z         D: int,
2025-05-07T20:32:52.3740876Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3740969Z         contiguous: bool,
2025-05-07T20:32:52.3741055Z         compiled: bool,
2025-05-07T20:32:52.3741131Z     ) -> None:
2025-05-07T20:32:52.3741229Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3741300Z     
2025-05-07T20:32:52.3741469Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3741547Z     
2025-05-07T20:32:52.3741748Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3741881Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3743774Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3743851Z 
2025-05-07T20:32:52.3743973Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:52.3743978Z 
2025-05-07T20:32:52.3744086Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3744313Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3744406Z     T=2048,
2025-05-07T20:32:52.3744488Z     D=7168,
2025-05-07T20:32:52.3744575Z     scale_ub=None,
2025-05-07T20:32:52.3744664Z     contiguous=True,
2025-05-07T20:32:52.3744754Z     compiled=False,
2025-05-07T20:32:52.3744835Z )
2025-05-07T20:32:52.3745060Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3745235Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:52.3745239Z 
2025-05-07T20:32:52.3745327Z     @given(
2025-05-07T20:32:52.3745445Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3745547Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3745666Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3745783Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3745896Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3745975Z     )
2025-05-07T20:32:52.3746228Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3746329Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3746409Z         self,
2025-05-07T20:32:52.3746489Z         T: int,
2025-05-07T20:32:52.3746571Z         D: int,
2025-05-07T20:32:52.3746675Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3746767Z         contiguous: bool,
2025-05-07T20:32:52.3746860Z         compiled: bool,
2025-05-07T20:32:52.3746936Z     ) -> None:
2025-05-07T20:32:52.3747030Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3747104Z     
2025-05-07T20:32:52.3747271Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3747344Z     
2025-05-07T20:32:52.3747440Z >       x_sign = torch.sign(x)
2025-05-07T20:32:52.3749337Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3749346Z 
2025-05-07T20:32:52.3749465Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:52.3749470Z 
2025-05-07T20:32:52.3749570Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3749799Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3749876Z     T=1,
2025-05-07T20:32:52.3749953Z     D=7168,
2025-05-07T20:32:52.3750038Z     scale_ub=1200.0,
2025-05-07T20:32:52.3750121Z     contiguous=True,
2025-05-07T20:32:52.3750204Z     compiled=False,
2025-05-07T20:32:52.3750280Z )
2025-05-07T20:32:52.3750498Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3750744Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:52.3750749Z 
2025-05-07T20:32:52.3750828Z     @given(
2025-05-07T20:32:52.3750947Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3751122Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3751236Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3751358Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3751471Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3751551Z     )
2025-05-07T20:32:52.3751803Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3751900Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3751983Z         self,
2025-05-07T20:32:52.3752065Z         T: int,
2025-05-07T20:32:52.3752144Z         D: int,
2025-05-07T20:32:52.3752245Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3752338Z         contiguous: bool,
2025-05-07T20:32:52.3752429Z         compiled: bool,
2025-05-07T20:32:52.3752506Z     ) -> None:
2025-05-07T20:32:52.3752599Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3752669Z     
2025-05-07T20:32:52.3752839Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3752916Z     
2025-05-07T20:32:52.3753008Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3753134Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3753222Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3753305Z         x0 = x[:, :D]
2025-05-07T20:32:52.3753384Z         x1 = x[:, D:]
2025-05-07T20:32:52.3753453Z     
2025-05-07T20:32:52.3753537Z         if contiguous:
2025-05-07T20:32:52.3753627Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3753714Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3753788Z     
2025-05-07T20:32:52.3753876Z         if scale_ub is not None:
2025-05-07T20:32:52.3754005Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3754172Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3754249Z             )
2025-05-07T20:32:52.3754324Z         else:
2025-05-07T20:32:52.3754421Z             scale_ub_tensor = None
2025-05-07T20:32:52.3754491Z     
2025-05-07T20:32:52.3754626Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3754716Z             op = silu_mul_quant
2025-05-07T20:32:52.3754798Z             if compiled:
2025-05-07T20:32:52.3754903Z                 op = torch.compile(op)
2025-05-07T20:32:52.3755007Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3755080Z     
2025-05-07T20:32:52.3755173Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3755177Z 
2025-05-07T20:32:52.3755272Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3755400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3755501Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3755598Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3756130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3756229Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3756601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3756834Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3757186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3757278Z     kernel = self.compile(
2025-05-07T20:32:52.3757679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3757852Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3757981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3758067Z 
2025-05-07T20:32:52.3758273Z self = <triton.compiler.compiler.ASTSource object at 0x7f127ad684d0>
2025-05-07T20:32:52.3759080Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3759744Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f121ba10f40>}
2025-05-07T20:32:52.3760535Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3760730Z context = <triton._C.libtriton.ir.context object at 0x7f121bb28e30>
2025-05-07T20:32:52.3760735Z 
2025-05-07T20:32:52.3760906Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3761179Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3761284Z                            module_map=module_map)
2025-05-07T20:32:52.3761449Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3761548Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3761625Z E       ^
2025-05-07T20:32:52.3761991Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3761996Z 
2025-05-07T20:32:52.3762430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3762434Z 
2025-05-07T20:32:52.3762539Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3762765Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3762842Z     T=128,
2025-05-07T20:32:52.3762921Z     D=5120,
2025-05-07T20:32:52.3763004Z     scale_ub=None,
2025-05-07T20:32:52.3763089Z     contiguous=True,
2025-05-07T20:32:52.3763172Z     compiled=False,
2025-05-07T20:32:52.3763245Z )
2025-05-07T20:32:52.3763463Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3763637Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:52.3763645Z 
2025-05-07T20:32:52.3763724Z     @given(
2025-05-07T20:32:52.3763840Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3763939Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3764051Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3764165Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3764293Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3764377Z     )
2025-05-07T20:32:52.3764652Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3764747Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3764823Z         self,
2025-05-07T20:32:52.3764895Z         T: int,
2025-05-07T20:32:52.3764972Z         D: int,
2025-05-07T20:32:52.3765073Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3765170Z         contiguous: bool,
2025-05-07T20:32:52.3765255Z         compiled: bool,
2025-05-07T20:32:52.3765333Z     ) -> None:
2025-05-07T20:32:52.3765430Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3765502Z     
2025-05-07T20:32:52.3765671Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3765748Z     
2025-05-07T20:32:52.3765838Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3765960Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3766053Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3766132Z         x0 = x[:, :D]
2025-05-07T20:32:52.3766210Z         x1 = x[:, D:]
2025-05-07T20:32:52.3766287Z     
2025-05-07T20:32:52.3766451Z         if contiguous:
2025-05-07T20:32:52.3766546Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3766635Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3766706Z     
2025-05-07T20:32:52.3766798Z         if scale_ub is not None:
2025-05-07T20:32:52.3766904Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3767111Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3767191Z             )
2025-05-07T20:32:52.3767266Z         else:
2025-05-07T20:32:52.3767356Z             scale_ub_tensor = None
2025-05-07T20:32:52.3767430Z     
2025-05-07T20:32:52.3767556Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3767644Z             op = silu_mul_quant
2025-05-07T20:32:52.3767730Z             if compiled:
2025-05-07T20:32:52.3767828Z                 op = torch.compile(op)
2025-05-07T20:32:52.3767933Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3768003Z     
2025-05-07T20:32:52.3768093Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3768103Z 
2025-05-07T20:32:52.3768204Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3768333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3768433Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3768541Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3769061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3769156Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3769530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3769759Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3770115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3770208Z     kernel = self.compile(
2025-05-07T20:32:52.3770609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3770788Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3770914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3770927Z 
2025-05-07T20:32:52.3771135Z self = <triton.compiler.compiler.ASTSource object at 0x7f127ad6a5a0>
2025-05-07T20:32:52.3771937Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3772451Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f121ba12020>}
2025-05-07T20:32:52.3773246Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3773438Z context = <triton._C.libtriton.ir.context object at 0x7f121bb13130>
2025-05-07T20:32:52.3773446Z 
2025-05-07T20:32:52.3773618Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3773887Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3773992Z                            module_map=module_map)
2025-05-07T20:32:52.3774155Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3774252Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3774333Z E       ^
2025-05-07T20:32:52.3774841Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3774846Z 
2025-05-07T20:32:52.3775379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3775384Z 
2025-05-07T20:32:52.3775493Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3775719Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3775897Z     T=128,
2025-05-07T20:32:52.3775980Z     D=7168,
2025-05-07T20:32:52.3776067Z     scale_ub=None,
2025-05-07T20:32:52.3776152Z     contiguous=True,
2025-05-07T20:32:52.3776237Z     compiled=False,
2025-05-07T20:32:52.3776311Z )
2025-05-07T20:32:52.3776532Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3776701Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:52.3776705Z 
2025-05-07T20:32:52.3776780Z     @given(
2025-05-07T20:32:52.3776899Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3776998Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3777115Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3777234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3777343Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3777418Z     )
2025-05-07T20:32:52.3777668Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3777758Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3777839Z         self,
2025-05-07T20:32:52.3777913Z         T: int,
2025-05-07T20:32:52.3777986Z         D: int,
2025-05-07T20:32:52.3778087Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3778174Z         contiguous: bool,
2025-05-07T20:32:52.3778258Z         compiled: bool,
2025-05-07T20:32:52.3778337Z     ) -> None:
2025-05-07T20:32:52.3778430Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3778502Z     
2025-05-07T20:32:52.3778675Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3778748Z     
2025-05-07T20:32:52.3778846Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3778968Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3779056Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3779138Z         x0 = x[:, :D]
2025-05-07T20:32:52.3779217Z         x1 = x[:, D:]
2025-05-07T20:32:52.3779293Z     
2025-05-07T20:32:52.3779378Z         if contiguous:
2025-05-07T20:32:52.3779469Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3779556Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3779632Z     
2025-05-07T20:32:52.3779722Z         if scale_ub is not None:
2025-05-07T20:32:52.3779828Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3779964Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3780041Z             )
2025-05-07T20:32:52.3780119Z         else:
2025-05-07T20:32:52.3780213Z             scale_ub_tensor = None
2025-05-07T20:32:52.3780285Z     
2025-05-07T20:32:52.3780416Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3780511Z             op = silu_mul_quant
2025-05-07T20:32:52.3780594Z             if compiled:
2025-05-07T20:32:52.3780697Z                 op = torch.compile(op)
2025-05-07T20:32:52.3780799Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3780874Z     
2025-05-07T20:32:52.3780966Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3780970Z 
2025-05-07T20:32:52.3781065Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3781196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3781293Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3781390Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3781916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3782011Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3782468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3782705Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3783062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3783236Z     kernel = self.compile(
2025-05-07T20:32:52.3783633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3783807Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3783940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3783944Z 
2025-05-07T20:32:52.3784151Z self = <triton.compiler.compiler.ASTSource object at 0x7f127b3b0ad0>
2025-05-07T20:32:52.3784964Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3785478Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f121ba12f20>}
2025-05-07T20:32:52.3786271Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3786467Z context = <triton._C.libtriton.ir.context object at 0x7f121bb178f0>
2025-05-07T20:32:52.3786471Z 
2025-05-07T20:32:52.3786636Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3786909Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3787014Z                            module_map=module_map)
2025-05-07T20:32:52.3787174Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3787278Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3787354Z E       ^
2025-05-07T20:32:52.3787723Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3787735Z 
2025-05-07T20:32:52.3788166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3788170Z 
2025-05-07T20:32:52.3788271Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3788495Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3788572Z     T=2048,
2025-05-07T20:32:52.3788649Z     D=7168,
2025-05-07T20:32:52.3788734Z     scale_ub=1200.0,
2025-05-07T20:32:52.3788818Z     contiguous=True,
2025-05-07T20:32:52.3788899Z     compiled=False,
2025-05-07T20:32:52.3788976Z )
2025-05-07T20:32:52.3789194Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3789377Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:52.3789382Z 
2025-05-07T20:32:52.3789458Z     @given(
2025-05-07T20:32:52.3789574Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3789681Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3789794Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3789908Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3790025Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3790098Z     )
2025-05-07T20:32:52.3790344Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3790441Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3790517Z         self,
2025-05-07T20:32:52.3790595Z         T: int,
2025-05-07T20:32:52.3790670Z         D: int,
2025-05-07T20:32:52.3790769Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3790859Z         contiguous: bool,
2025-05-07T20:32:52.3791029Z         compiled: bool,
2025-05-07T20:32:52.3791110Z     ) -> None:
2025-05-07T20:32:52.3791207Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3791278Z     
2025-05-07T20:32:52.3791448Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3793435Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3793440Z 
2025-05-07T20:32:52.3793559Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:52.3793564Z 
2025-05-07T20:32:52.3793673Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3793913Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3794001Z     T=1,
2025-05-07T20:32:52.3794089Z     D=5120,
2025-05-07T20:32:52.3794189Z     scale_ub=1200.0,
2025-05-07T20:32:52.3794276Z     contiguous=True,
2025-05-07T20:32:52.3794361Z     compiled=False,
2025-05-07T20:32:52.3794434Z )
2025-05-07T20:32:52.3794655Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3794822Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:52.3794826Z 
2025-05-07T20:32:52.3794902Z     @given(
2025-05-07T20:32:52.3795022Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3795119Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3795236Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3795352Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3795467Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3795545Z     )
2025-05-07T20:32:52.3795793Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3795887Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3795971Z         self,
2025-05-07T20:32:52.3796047Z         T: int,
2025-05-07T20:32:52.3796119Z         D: int,
2025-05-07T20:32:52.3796217Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3796304Z         contiguous: bool,
2025-05-07T20:32:52.3796388Z         compiled: bool,
2025-05-07T20:32:52.3796468Z     ) -> None:
2025-05-07T20:32:52.3796559Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3796634Z     
2025-05-07T20:32:52.3796801Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3796875Z     
2025-05-07T20:32:52.3796969Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3797090Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3797182Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3797265Z         x0 = x[:, :D]
2025-05-07T20:32:52.3797343Z         x1 = x[:, D:]
2025-05-07T20:32:52.3797414Z     
2025-05-07T20:32:52.3797501Z         if contiguous:
2025-05-07T20:32:52.3797590Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3797681Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3797754Z     
2025-05-07T20:32:52.3797842Z         if scale_ub is not None:
2025-05-07T20:32:52.3797944Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3798081Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3798159Z             )
2025-05-07T20:32:52.3798240Z         else:
2025-05-07T20:32:52.3798331Z             scale_ub_tensor = None
2025-05-07T20:32:52.3798402Z     
2025-05-07T20:32:52.3798532Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3798623Z             op = silu_mul_quant
2025-05-07T20:32:52.3798705Z             if compiled:
2025-05-07T20:32:52.3798889Z                 op = torch.compile(op)
2025-05-07T20:32:52.3798994Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3799066Z     
2025-05-07T20:32:52.3799160Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3799164Z 
2025-05-07T20:32:52.3799362Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3799495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3799595Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3799694Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3800218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3800313Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3800683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3800910Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3801268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3801365Z     kernel = self.compile(
2025-05-07T20:32:52.3801763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3801943Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3802080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3802084Z 
2025-05-07T20:32:52.3802289Z self = <triton.compiler.compiler.ASTSource object at 0x7f127ab7efc0>
2025-05-07T20:32:52.3803098Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3803614Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f121b9384a0>}
2025-05-07T20:32:52.3804423Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3804653Z context = <triton._C.libtriton.ir.context object at 0x7f121b911db0>
2025-05-07T20:32:52.3804658Z 
2025-05-07T20:32:52.3804821Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3805091Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3805196Z                            module_map=module_map)
2025-05-07T20:32:52.3805355Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3805457Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3805532Z E       ^
2025-05-07T20:32:52.3805902Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3805911Z 
2025-05-07T20:32:52.3806340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3806349Z 
2025-05-07T20:32:52.3806451Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3806681Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3806760Z     T=2048,
2025-05-07T20:32:52.3806834Z     D=5120,
2025-05-07T20:32:52.3806918Z     scale_ub=None,
2025-05-07T20:32:52.3807002Z     contiguous=True,
2025-05-07T20:32:52.3807085Z     compiled=False,
2025-05-07T20:32:52.3807160Z )
2025-05-07T20:32:52.3807378Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3807552Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:52.3807556Z 
2025-05-07T20:32:52.3807717Z     @given(
2025-05-07T20:32:52.3807836Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3807939Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3808051Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3808244Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3808366Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3808443Z     )
2025-05-07T20:32:52.3808695Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3808795Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3808876Z         self,
2025-05-07T20:32:52.3808959Z         T: int,
2025-05-07T20:32:52.3809037Z         D: int,
2025-05-07T20:32:52.3809136Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3809229Z         contiguous: bool,
2025-05-07T20:32:52.3809318Z         compiled: bool,
2025-05-07T20:32:52.3809394Z     ) -> None:
2025-05-07T20:32:52.3809494Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3809567Z     
2025-05-07T20:32:52.3809735Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3809811Z     
2025-05-07T20:32:52.3809903Z >       x_sign = torch.sign(x)
2025-05-07T20:32:52.3811813Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3811819Z 
2025-05-07T20:32:52.3815432Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:52.3815442Z 
2025-05-07T20:32:52.3815575Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3815811Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3815889Z     T=16384,
2025-05-07T20:32:52.3815964Z     D=5120,
2025-05-07T20:32:52.3816048Z     scale_ub=None,
2025-05-07T20:32:52.3816142Z     contiguous=True,
2025-05-07T20:32:52.3816224Z     compiled=False,
2025-05-07T20:32:52.3816303Z )
2025-05-07T20:32:52.3816524Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3816703Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:52.3816708Z 
2025-05-07T20:32:52.3816787Z     @given(
2025-05-07T20:32:52.3816904Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3817001Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3817122Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3817237Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3817358Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3817432Z     )
2025-05-07T20:32:52.3817683Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3817785Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3817864Z         self,
2025-05-07T20:32:52.3817941Z         T: int,
2025-05-07T20:32:52.3818021Z         D: int,
2025-05-07T20:32:52.3818117Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3818203Z         contiguous: bool,
2025-05-07T20:32:52.3818290Z         compiled: bool,
2025-05-07T20:32:52.3818367Z     ) -> None:
2025-05-07T20:32:52.3818460Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3818533Z     
2025-05-07T20:32:52.3818701Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3820716Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3820794Z 
2025-05-07T20:32:52.3820914Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:52.3820918Z 
2025-05-07T20:32:52.3821026Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3821258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3821334Z     T=4096,
2025-05-07T20:32:52.3821420Z     D=5120,
2025-05-07T20:32:52.3821507Z     scale_ub=None,
2025-05-07T20:32:52.3821592Z     contiguous=True,
2025-05-07T20:32:52.3821679Z     compiled=False,
2025-05-07T20:32:52.3821753Z )
2025-05-07T20:32:52.3821979Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3822162Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:52.3822167Z 
2025-05-07T20:32:52.3822247Z     @given(
2025-05-07T20:32:52.3822367Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3822475Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3822588Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3822710Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3822823Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3822900Z     )
2025-05-07T20:32:52.3823153Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3823255Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3823334Z         self,
2025-05-07T20:32:52.3823419Z         T: int,
2025-05-07T20:32:52.3823498Z         D: int,
2025-05-07T20:32:52.3823600Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3823695Z         contiguous: bool,
2025-05-07T20:32:52.3823783Z         compiled: bool,
2025-05-07T20:32:52.3823862Z     ) -> None:
2025-05-07T20:32:52.3823954Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3824028Z     
2025-05-07T20:32:52.3824198Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3826258Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3826265Z 
2025-05-07T20:32:52.3826386Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:52.3826395Z 
2025-05-07T20:32:52.3826495Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3826723Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3826806Z     T=2048,
2025-05-07T20:32:52.3826886Z     D=5120,
2025-05-07T20:32:52.3826968Z     scale_ub=None,
2025-05-07T20:32:52.3827057Z     contiguous=False,
2025-05-07T20:32:52.3827140Z     compiled=False,
2025-05-07T20:32:52.3827218Z )
2025-05-07T20:32:52.3827435Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3827608Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:52.3827613Z 
2025-05-07T20:32:52.3827693Z     @given(
2025-05-07T20:32:52.3827812Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3827910Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3828024Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3828273Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3828390Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3828464Z     )
2025-05-07T20:32:52.3828711Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3828924Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3828999Z         self,
2025-05-07T20:32:52.3829074Z         T: int,
2025-05-07T20:32:52.3829152Z         D: int,
2025-05-07T20:32:52.3829246Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3829332Z         contiguous: bool,
2025-05-07T20:32:52.3829421Z         compiled: bool,
2025-05-07T20:32:52.3829496Z     ) -> None:
2025-05-07T20:32:52.3829589Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3829665Z     
2025-05-07T20:32:52.3829830Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3831737Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3831748Z 
2025-05-07T20:32:52.3831862Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:52.3831866Z 
2025-05-07T20:32:52.3831968Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3832191Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3832264Z     T=4096,
2025-05-07T20:32:52.3832346Z     D=7168,
2025-05-07T20:32:52.3832425Z     scale_ub=None,
2025-05-07T20:32:52.3832507Z     contiguous=True,
2025-05-07T20:32:52.3832590Z     compiled=True,
2025-05-07T20:32:52.3832662Z )
2025-05-07T20:32:52.3832886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3833059Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:52.3833064Z 
2025-05-07T20:32:52.3833142Z     @given(
2025-05-07T20:32:52.3833263Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3833361Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3833472Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3833588Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3833702Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3833776Z     )
2025-05-07T20:32:52.3834034Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3834128Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3834203Z         self,
2025-05-07T20:32:52.3834282Z         T: int,
2025-05-07T20:32:52.3834358Z         D: int,
2025-05-07T20:32:52.3834458Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3834549Z         contiguous: bool,
2025-05-07T20:32:52.3834632Z         compiled: bool,
2025-05-07T20:32:52.3834713Z     ) -> None:
2025-05-07T20:32:52.3834805Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3834881Z     
2025-05-07T20:32:52.3835049Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3836946Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3836952Z 
2025-05-07T20:32:52.3837156Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:52.3837160Z 
2025-05-07T20:32:52.3837266Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3837491Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3837667Z     T=2048,
2025-05-07T20:32:52.3837745Z     D=5120,
2025-05-07T20:32:52.3837829Z     scale_ub=1200.0,
2025-05-07T20:32:52.3837917Z     contiguous=False,
2025-05-07T20:32:52.3837998Z     compiled=False,
2025-05-07T20:32:52.3838075Z )
2025-05-07T20:32:52.3838293Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3838466Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:52.3838470Z 
2025-05-07T20:32:52.3838552Z     @given(
2025-05-07T20:32:52.3838668Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3838764Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3838885Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3839004Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3839119Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3839191Z     )
2025-05-07T20:32:52.3839437Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3839537Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3839612Z         self,
2025-05-07T20:32:52.3839688Z         T: int,
2025-05-07T20:32:52.3839767Z         D: int,
2025-05-07T20:32:52.3839862Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3839949Z         contiguous: bool,
2025-05-07T20:32:52.3840038Z         compiled: bool,
2025-05-07T20:32:52.3840116Z     ) -> None:
2025-05-07T20:32:52.3840208Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3840287Z     
2025-05-07T20:32:52.3840452Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3842351Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3842364Z 
2025-05-07T20:32:52.3842479Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:52.3842483Z 
2025-05-07T20:32:52.3842584Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3842807Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3842884Z     T=4096,
2025-05-07T20:32:52.3842962Z     D=7168,
2025-05-07T20:32:52.3843044Z     scale_ub=1200.0,
2025-05-07T20:32:52.3843128Z     contiguous=True,
2025-05-07T20:32:52.3843222Z     compiled=False,
2025-05-07T20:32:52.3843295Z )
2025-05-07T20:32:52.3843512Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3843692Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:52.3843701Z 
2025-05-07T20:32:52.3843778Z     @given(
2025-05-07T20:32:52.3843895Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3843991Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3844102Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3844217Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3844328Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3844400Z     )
2025-05-07T20:32:52.3844650Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3844744Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3844818Z         self,
2025-05-07T20:32:52.3844977Z         T: int,
2025-05-07T20:32:52.3845055Z         D: int,
2025-05-07T20:32:52.3845153Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3845241Z         contiguous: bool,
2025-05-07T20:32:52.3845325Z         compiled: bool,
2025-05-07T20:32:52.3845482Z     ) -> None:
2025-05-07T20:32:52.3845573Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3845644Z     
2025-05-07T20:32:52.3845814Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3847717Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3847723Z 
2025-05-07T20:32:52.3847841Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:52.3847845Z 
2025-05-07T20:32:52.3847944Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3848174Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3848254Z     T=16384,
2025-05-07T20:32:52.3848335Z     D=7168,
2025-05-07T20:32:52.3848420Z     scale_ub=None,
2025-05-07T20:32:52.3848513Z     contiguous=False,
2025-05-07T20:32:52.3848599Z     compiled=True,
2025-05-07T20:32:52.3848682Z )
2025-05-07T20:32:52.3848901Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3849079Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:52.3849083Z 
2025-05-07T20:32:52.3849165Z     @given(
2025-05-07T20:32:52.3849282Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3849385Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3849505Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3849620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3849738Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3849820Z     )
2025-05-07T20:32:52.3850069Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3850168Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3850249Z         self,
2025-05-07T20:32:52.3850330Z         T: int,
2025-05-07T20:32:52.3850412Z         D: int,
2025-05-07T20:32:52.3850512Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3850602Z         contiguous: bool,
2025-05-07T20:32:52.3850696Z         compiled: bool,
2025-05-07T20:32:52.3850773Z     ) -> None:
2025-05-07T20:32:52.3850864Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3850940Z     
2025-05-07T20:32:52.3851104Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3853010Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3853020Z 
2025-05-07T20:32:52.3853135Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:52.3853139Z 
2025-05-07T20:32:52.3853245Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3853471Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3853547Z     T=4096,
2025-05-07T20:32:52.3853627Z     D=7168,
2025-05-07T20:32:52.3853793Z     scale_ub=None,
2025-05-07T20:32:52.3853879Z     contiguous=True,
2025-05-07T20:32:52.3853964Z     compiled=False,
2025-05-07T20:32:52.3854036Z )
2025-05-07T20:32:52.3854255Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3854642Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:52.3854648Z 
2025-05-07T20:32:52.3854728Z     @given(
2025-05-07T20:32:52.3854852Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3854953Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3855069Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3855190Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3855304Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3855380Z     )
2025-05-07T20:32:52.3855635Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3855739Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3855821Z         self,
2025-05-07T20:32:52.3855902Z         T: int,
2025-05-07T20:32:52.3855981Z         D: int,
2025-05-07T20:32:52.3856082Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3856172Z         contiguous: bool,
2025-05-07T20:32:52.3856264Z         compiled: bool,
2025-05-07T20:32:52.3856344Z     ) -> None:
2025-05-07T20:32:52.3856439Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3856514Z     
2025-05-07T20:32:52.3856685Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3858587Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3858593Z 
2025-05-07T20:32:52.3858710Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:52.3858718Z 
2025-05-07T20:32:52.3858818Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3859039Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3859118Z     T=16384,
2025-05-07T20:32:52.3859194Z     D=7168,
2025-05-07T20:32:52.3859278Z     scale_ub=None,
2025-05-07T20:32:52.3859365Z     contiguous=True,
2025-05-07T20:32:52.3859451Z     compiled=False,
2025-05-07T20:32:52.3859526Z )
2025-05-07T20:32:52.3859745Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3859919Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:52.3859923Z 
2025-05-07T20:32:52.3860002Z     @given(
2025-05-07T20:32:52.3860120Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3860216Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3860329Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3860443Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3860561Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3860634Z     )
2025-05-07T20:32:52.3860879Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3860976Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3861052Z         self,
2025-05-07T20:32:52.3861126Z         T: int,
2025-05-07T20:32:52.3861204Z         D: int,
2025-05-07T20:32:52.3861300Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3861389Z         contiguous: bool,
2025-05-07T20:32:52.3861474Z         compiled: bool,
2025-05-07T20:32:52.3861551Z     ) -> None:
2025-05-07T20:32:52.3861643Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3861717Z     
2025-05-07T20:32:52.3861967Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3863877Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3863955Z 
2025-05-07T20:32:52.3864074Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:52.3864078Z 
2025-05-07T20:32:52.3864187Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3864416Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3864503Z     T=16384,
2025-05-07T20:32:52.3864581Z     D=7168,
2025-05-07T20:32:52.3864665Z     scale_ub=1200.0,
2025-05-07T20:32:52.3864748Z     contiguous=True,
2025-05-07T20:32:52.3864833Z     compiled=False,
2025-05-07T20:32:52.3864905Z )
2025-05-07T20:32:52.3865130Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3865313Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:52.3865318Z 
2025-05-07T20:32:52.3865394Z     @given(
2025-05-07T20:32:52.3865512Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3865609Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3865718Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3865836Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3865947Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3866020Z     )
2025-05-07T20:32:52.3866273Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3866367Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3866442Z         self,
2025-05-07T20:32:52.3866523Z         T: int,
2025-05-07T20:32:52.3866601Z         D: int,
2025-05-07T20:32:52.3866698Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3866791Z         contiguous: bool,
2025-05-07T20:32:52.3866875Z         compiled: bool,
2025-05-07T20:32:52.3866953Z     ) -> None:
2025-05-07T20:32:52.3867045Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3867116Z     
2025-05-07T20:32:52.3867283Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3869182Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3869188Z 
2025-05-07T20:32:52.3869311Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:52.3869315Z 
2025-05-07T20:32:52.3869416Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3869638Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3869718Z     T=128,
2025-05-07T20:32:52.3869794Z     D=5120,
2025-05-07T20:32:52.3869879Z     scale_ub=1200.0,
2025-05-07T20:32:52.3869966Z     contiguous=False,
2025-05-07T20:32:52.3870048Z     compiled=False,
2025-05-07T20:32:52.3870126Z )
2025-05-07T20:32:52.3870341Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3870514Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:52.3870519Z 
2025-05-07T20:32:52.3870704Z     @given(
2025-05-07T20:32:52.3870823Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3870920Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3871039Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3871225Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3871344Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3871415Z     )
2025-05-07T20:32:52.3871662Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3871760Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3871837Z         self,
2025-05-07T20:32:52.3871912Z         T: int,
2025-05-07T20:32:52.3871990Z         D: int,
2025-05-07T20:32:52.3872085Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3872172Z         contiguous: bool,
2025-05-07T20:32:52.3872259Z         compiled: bool,
2025-05-07T20:32:52.3872336Z     ) -> None:
2025-05-07T20:32:52.3872433Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3872509Z     
2025-05-07T20:32:52.3872676Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3872750Z     
2025-05-07T20:32:52.3872847Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3872977Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3873068Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3873146Z         x0 = x[:, :D]
2025-05-07T20:32:52.3873225Z         x1 = x[:, D:]
2025-05-07T20:32:52.3873297Z     
2025-05-07T20:32:52.3873379Z         if contiguous:
2025-05-07T20:32:52.3873472Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3873564Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3873635Z     
2025-05-07T20:32:52.3873726Z         if scale_ub is not None:
2025-05-07T20:32:52.3873835Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3873972Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3874045Z             )
2025-05-07T20:32:52.3874127Z         else:
2025-05-07T20:32:52.3874220Z             scale_ub_tensor = None
2025-05-07T20:32:52.3874295Z     
2025-05-07T20:32:52.3874431Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3874539Z             op = silu_mul_quant
2025-05-07T20:32:52.3874642Z             if compiled:
2025-05-07T20:32:52.3874753Z                 op = torch.compile(op)
2025-05-07T20:32:52.3874857Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3874932Z     
2025-05-07T20:32:52.3875020Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3875025Z 
2025-05-07T20:32:52.3875125Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3875255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3875354Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3875456Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3875986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3876083Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3876461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3876699Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3877056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3877149Z     kernel = self.compile(
2025-05-07T20:32:52.3877552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3877726Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3877855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3877863Z 
2025-05-07T20:32:52.3878152Z self = <triton.compiler.compiler.ASTSource object at 0x7f121b865790>
2025-05-07T20:32:52.3878965Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3879557Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f121b943060>}
2025-05-07T20:32:52.3880349Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3880549Z context = <triton._C.libtriton.ir.context object at 0x7f121b72c8f0>
2025-05-07T20:32:52.3880553Z 
2025-05-07T20:32:52.3880722Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3881001Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3881113Z                            module_map=module_map)
2025-05-07T20:32:52.3881278Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3881386Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3881465Z E       ^
2025-05-07T20:32:52.3881833Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3881837Z 
2025-05-07T20:32:52.3882272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3882276Z 
2025-05-07T20:32:52.3882376Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3882601Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3882684Z     T=2048,
2025-05-07T20:32:52.3882757Z     D=7168,
2025-05-07T20:32:52.3882844Z     scale_ub=None,
2025-05-07T20:32:52.3882933Z     contiguous=False,
2025-05-07T20:32:52.3883015Z     compiled=False,
2025-05-07T20:32:52.3883089Z )
2025-05-07T20:32:52.3883307Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3883480Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:52.3883489Z 
2025-05-07T20:32:52.3883567Z     @given(
2025-05-07T20:32:52.3883682Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3883783Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3883897Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3884011Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3884126Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3884198Z     )
2025-05-07T20:32:52.3884444Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3884539Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3884638Z         self,
2025-05-07T20:32:52.3884721Z         T: int,
2025-05-07T20:32:52.3884817Z         D: int,
2025-05-07T20:32:52.3884920Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3885008Z         contiguous: bool,
2025-05-07T20:32:52.3885097Z         compiled: bool,
2025-05-07T20:32:52.3885177Z     ) -> None:
2025-05-07T20:32:52.3885275Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3885352Z     
2025-05-07T20:32:52.3885524Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3887509Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3887515Z 
2025-05-07T20:32:52.3887636Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:52.3887640Z 
2025-05-07T20:32:52.3887746Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3888046Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3888122Z     T=128,
2025-05-07T20:32:52.3888201Z     D=7168,
2025-05-07T20:32:52.3888284Z     scale_ub=1200.0,
2025-05-07T20:32:52.3888368Z     contiguous=True,
2025-05-07T20:32:52.3888452Z     compiled=True,
2025-05-07T20:32:52.3888526Z )
2025-05-07T20:32:52.3888745Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3888917Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:52.3888921Z 
2025-05-07T20:32:52.3888997Z     @given(
2025-05-07T20:32:52.3889116Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3889223Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3889333Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3889452Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3889563Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3889641Z     )
2025-05-07T20:32:52.3889894Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3889990Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3890065Z         self,
2025-05-07T20:32:52.3890145Z         T: int,
2025-05-07T20:32:52.3890219Z         D: int,
2025-05-07T20:32:52.3890318Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3890406Z         contiguous: bool,
2025-05-07T20:32:52.3890490Z         compiled: bool,
2025-05-07T20:32:52.3890569Z     ) -> None:
2025-05-07T20:32:52.3890664Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3890739Z     
2025-05-07T20:32:52.3890915Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3890991Z     
2025-05-07T20:32:52.3891088Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3891216Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3891306Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3891391Z         x0 = x[:, :D]
2025-05-07T20:32:52.3891474Z         x1 = x[:, D:]
2025-05-07T20:32:52.3891548Z     
2025-05-07T20:32:52.3891635Z         if contiguous:
2025-05-07T20:32:52.3891729Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3891819Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3891898Z     
2025-05-07T20:32:52.3891990Z         if scale_ub is not None:
2025-05-07T20:32:52.3892096Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3892238Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3892317Z             )
2025-05-07T20:32:52.3892398Z         else:
2025-05-07T20:32:52.3892500Z             scale_ub_tensor = None
2025-05-07T20:32:52.3892575Z     
2025-05-07T20:32:52.3892705Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3892799Z             op = silu_mul_quant
2025-05-07T20:32:52.3892882Z             if compiled:
2025-05-07T20:32:52.3892982Z                 op = torch.compile(op)
2025-05-07T20:32:52.3893096Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3893170Z     
2025-05-07T20:32:52.3893262Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3893267Z 
2025-05-07T20:32:52.3893364Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3893494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3893596Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3893693Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3894127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:52.3894222Z     return fn(*args, **kwargs)
2025-05-07T20:32:52.3894993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3895103Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3895476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3895780Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3896139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3896231Z     kernel = self.compile(
2025-05-07T20:32:52.3896629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3896807Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3896935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3896940Z 
2025-05-07T20:32:52.3897154Z self = <triton.compiler.compiler.ASTSource object at 0x7f121b7d1940>
2025-05-07T20:32:52.3897961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3898482Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f12d321d300>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f121b7cc900>}
2025-05-07T20:32:52.3899272Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3899465Z context = <triton._C.libtriton.ir.context object at 0x7f121b7a1d70>
2025-05-07T20:32:52.3899470Z 
2025-05-07T20:32:52.3899646Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3899917Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3900024Z                            module_map=module_map)
2025-05-07T20:32:52.3900185Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3900286Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3900363Z E       ^
2025-05-07T20:32:52.3900735Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3900740Z 
2025-05-07T20:32:52.3901173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3901180Z 
2025-05-07T20:32:52.3901283Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3901510Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3901591Z     T=128,
2025-05-07T20:32:52.3901667Z     D=7168,
2025-05-07T20:32:52.3901754Z     scale_ub=1200.0,
2025-05-07T20:32:52.3901842Z     contiguous=True,
2025-05-07T20:32:52.3901928Z     compiled=False,
2025-05-07T20:32:52.3902000Z )
2025-05-07T20:32:52.3902223Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3902400Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:52.3902404Z 
2025-05-07T20:32:52.3902484Z     @given(
2025-05-07T20:32:52.3902600Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3902698Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3902814Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3902929Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3903039Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3903115Z     )
2025-05-07T20:32:52.3903363Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3903556Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3903634Z         self,
2025-05-07T20:32:52.3903709Z         T: int,
2025-05-07T20:32:52.3903785Z         D: int,
2025-05-07T20:32:52.3903882Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3904041Z         contiguous: bool,
2025-05-07T20:32:52.3904127Z         compiled: bool,
2025-05-07T20:32:52.3904200Z     ) -> None:
2025-05-07T20:32:52.3904290Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3904363Z     
2025-05-07T20:32:52.3904529Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3904602Z     
2025-05-07T20:32:52.3904695Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3904815Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3906724Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3906735Z 
2025-05-07T20:32:52.3906849Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:52.3906854Z 
2025-05-07T20:32:52.3906951Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3907177Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3907253Z     T=128,
2025-05-07T20:32:52.3907337Z     D=5120,
2025-05-07T20:32:52.3907416Z     scale_ub=1200.0,
2025-05-07T20:32:52.3907497Z     contiguous=True,
2025-05-07T20:32:52.3907579Z     compiled=True,
2025-05-07T20:32:52.3907647Z )
2025-05-07T20:32:52.3907864Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3908037Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:52.3908042Z 
2025-05-07T20:32:52.3908115Z     @given(
2025-05-07T20:32:52.3908227Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3908330Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3908439Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3908554Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3908662Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3908734Z     )
2025-05-07T20:32:52.3908985Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3909075Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3909148Z         self,
2025-05-07T20:32:52.3909225Z         T: int,
2025-05-07T20:32:52.3909298Z         D: int,
2025-05-07T20:32:52.3909390Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3909478Z         contiguous: bool,
2025-05-07T20:32:52.3909565Z         compiled: bool,
2025-05-07T20:32:52.3909639Z     ) -> None:
2025-05-07T20:32:52.3909733Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3909803Z     
2025-05-07T20:32:52.3909975Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3910052Z     
2025-05-07T20:32:52.3910141Z >       x_sign = torch.sign(x)
2025-05-07T20:32:52.3912049Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3912055Z 
2025-05-07T20:32:52.3912251Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:52.3912255Z 
2025-05-07T20:32:52.3912363Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3912589Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3912741Z     T=128,
2025-05-07T20:32:52.3912820Z     D=7168,
2025-05-07T20:32:52.3912897Z     scale_ub=None,
2025-05-07T20:32:52.3912979Z     contiguous=True,
2025-05-07T20:32:52.3913059Z     compiled=True,
2025-05-07T20:32:52.3913130Z )
2025-05-07T20:32:52.3913348Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3913516Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:52.3913521Z 
2025-05-07T20:32:52.3913594Z     @given(
2025-05-07T20:32:52.3913709Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3913804Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3913919Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3914056Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3914176Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3914265Z     )
2025-05-07T20:32:52.3914519Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3914612Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3914684Z         self,
2025-05-07T20:32:52.3914756Z         T: int,
2025-05-07T20:32:52.3914828Z         D: int,
2025-05-07T20:32:52.3914924Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3915010Z         contiguous: bool,
2025-05-07T20:32:52.3915093Z         compiled: bool,
2025-05-07T20:32:52.3915169Z     ) -> None:
2025-05-07T20:32:52.3915259Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3915329Z     
2025-05-07T20:32:52.3915497Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3917394Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:52.3917405Z 
2025-05-07T20:32:52.3917519Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:52.3917651Z =============================== warnings summary ===============================
2025-05-07T20:32:52.3917967Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:52.3918276Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:52.3918584Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:52.3919514Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:52.3919751Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:52.3919755Z 
2025-05-07T20:32:52.3919935Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings
2025-05-07T20:32:52.3921376Z   /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
2025-05-07T20:32:52.3921570Z     torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3)
2025-05-07T20:32:52.3921578Z 
2025-05-07T20:32:52.3921793Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:52.3922031Z ================== 1 failed, 1 passed, 13 warnings in 20.57s ===================
2025-05-07T20:32:54.2490121Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:54.3144448Z 
2025-05-07T20:32:54.3144861Z [TEST] Some tests FAILED.  Re-attempting only FAILED tests: ./moe/activation_test.py
2025-05-07T20:32:54.3145232Z 
2025-05-07T20:32:54.3145236Z 
2025-05-07T20:32:54.3168165Z [EXEC] [ATTEMPT 0/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:56.4880275Z ============================= test session starts ==============================
2025-05-07T20:32:56.4880908Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:56.4881454Z cachedir: .pytest_cache
2025-05-07T20:32:56.4882038Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:56.4882783Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:56.4883202Z plugins: hypothesis-6.131.14
2025-05-07T20:32:58.1262337Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:58.2359117Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:58.2359711Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:58.2360012Z 
2025-05-07T20:33:00.3994938Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:33:00.3996087Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:33:00.3997514Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:33:00.3999037Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:33:00.4001655Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:00.4003037Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:33:00.4004484Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.4005516Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:00.4006809Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:00.4008603Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.4009724Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:00.4011211Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:33:00.4012527Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:33:00.4013813Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:33:00.4015245Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:33:00.4016116Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:00.4017192Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:33:00.4018257Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:33:00.4019086Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^
2025-05-07T20:33:00.4020369Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:33:00.4021710Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:33:00.4022885Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:33:00.4023999Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:33:00.4031926Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:33:00.4033397Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:33:00.4034513Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.4035483Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.4036265Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:33:00.4037339Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.4157868Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:33:00.4159231Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:33:00.4160646Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:33:00.4162268Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:33:00.4163292Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:00.4164673Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:33:00.4166125Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.4167162Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:00.4168460Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:00.4169909Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.4171033Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:00.4172379Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:33:00.4173705Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:33:00.4175133Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:33:00.4176428Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:33:00.4177326Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:00.4178392Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:33:00.4179466Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:33:00.4180299Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]            ^^^^^^^^^^^^^
2025-05-07T20:33:00.4181574Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:33:00.4183013Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:33:00.4184195Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:33:00.4185378Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:33:00.4186624Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:33:00.4188062Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:33:00.4189185Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.4190138Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.4190920Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:33:00.4191992Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.8359927Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.8360646Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.8361076Z     T=1,
2025-05-07T20:33:00.8361260Z     D=5120,
2025-05-07T20:33:00.8361457Z     scale_ub=None,
2025-05-07T20:33:00.8361711Z     contiguous=True,
2025-05-07T20:33:00.8361930Z     compiled=True,
2025-05-07T20:33:00.8362141Z )
2025-05-07T20:33:00.8362469Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.8362965Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.8363254Z 
2025-05-07T20:33:00.8363335Z     @given(
2025-05-07T20:33:00.8363569Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.8363888Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.8364208Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.8364553Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.8364893Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.8365187Z     )
2025-05-07T20:33:00.8365549Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.8366010Z     def test_silu_mul_quant(
2025-05-07T20:33:00.8366257Z         self,
2025-05-07T20:33:00.8366465Z         T: int,
2025-05-07T20:33:00.8366668Z         D: int,
2025-05-07T20:33:00.8366891Z         scale_ub: Optional[float],
2025-05-07T20:33:00.8367178Z         contiguous: bool,
2025-05-07T20:33:00.8367425Z         compiled: bool,
2025-05-07T20:33:00.8367658Z     ) -> None:
2025-05-07T20:33:00.8367884Z         torch.manual_seed(2025)
2025-05-07T20:33:00.8368131Z     
2025-05-07T20:33:00.8368408Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.8368780Z     
2025-05-07T20:33:00.8368988Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.8369296Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.8369614Z         x = x_sign * x_clamp
2025-05-07T20:33:00.8369863Z         x0 = x[:, :D]
2025-05-07T20:33:00.8370089Z         x1 = x[:, D:]
2025-05-07T20:33:00.8370295Z     
2025-05-07T20:33:00.8370486Z         if contiguous:
2025-05-07T20:33:00.8370725Z             x0 = x0.contiguous()
2025-05-07T20:33:00.8371344Z             x1 = x1.contiguous()
2025-05-07T20:33:00.8371599Z     
2025-05-07T20:33:00.8371800Z         if scale_ub is not None:
2025-05-07T20:33:00.8372076Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.8372423Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.8372890Z             )
2025-05-07T20:33:00.8373083Z         else:
2025-05-07T20:33:00.8373299Z             scale_ub_tensor = None
2025-05-07T20:33:00.8373566Z     
2025-05-07T20:33:00.8373794Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.8374119Z             op = silu_mul_quant
2025-05-07T20:33:00.8374483Z             if compiled:
2025-05-07T20:33:00.8374736Z                 op = torch.compile(op)
2025-05-07T20:33:00.8375037Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.8375323Z     
2025-05-07T20:33:00.8375517Z         y_fp8, y_scale = fn()
2025-05-07T20:33:00.8375799Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:00.8376110Z     
2025-05-07T20:33:00.8376390Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.8376726Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:00.8377027Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:00.8377352Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:00.8377719Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.8378039Z     
2025-05-07T20:33:00.8378242Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.8378444Z 
2025-05-07T20:33:00.8378555Z moe/activation_test.py:126: 
2025-05-07T20:33:00.8378853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.8379202Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:00.8379541Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.8380373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:00.8381168Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.8381742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.8382464Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.8383190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:00.8383957Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.8384732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:00.8385403Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.8386041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:00.8386590Z     fn()
2025-05-07T20:33:00.8387131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:00.8387749Z     self.fn.run(
2025-05-07T20:33:00.8388246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.8388814Z     kernel = self.compile(
2025-05-07T20:33:00.8389378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.8390071Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.8390489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.8390726Z 
2025-05-07T20:33:00.8390944Z self = <triton.compiler.compiler.ASTSource object at 0x7f994940fe00>
2025-05-07T20:33:00.8392166Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.8393629Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f99497ecc20>}
2025-05-07T20:33:00.8395131Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.8396223Z context = <triton._C.libtriton.ir.context object at 0x7f9949d506f0>
2025-05-07T20:33:00.8396525Z 
2025-05-07T20:33:00.8396706Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.8397252Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.8397746Z                            module_map=module_map)
2025-05-07T20:33:00.8398135Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.8398505Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.8398796Z E       ^
2025-05-07T20:33:00.8399287Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.8399771Z 
2025-05-07T20:33:00.8400220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.8400766Z 
2025-05-07T20:33:00.8400874Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.8401310Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.8401738Z     T=2048,
2025-05-07T20:33:00.8401934Z     D=5120,
2025-05-07T20:33:00.8402130Z     scale_ub=1200.0,
2025-05-07T20:33:00.8402355Z     contiguous=True,
2025-05-07T20:33:00.8402576Z     compiled=False,
2025-05-07T20:33:00.8402788Z )
2025-05-07T20:33:01.2872141Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:33:01.2873323Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:33:01.2874768Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:33:01.2876307Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:33:01.2877347Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:01.2878728Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:33:01.2880216Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.2881258Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:01.2882561Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:01.2884415Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.2885550Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:01.2887119Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:33:01.2888456Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:33:01.2889751Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:33:01.2891030Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:33:01.2891892Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:01.2892985Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:33:01.2894064Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:33:01.2895019Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^
2025-05-07T20:33:01.2896308Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:33:01.2897657Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:33:01.2898844Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:33:01.2899952Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:33:01.2901203Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:33:01.2902645Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:33:01.2903760Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.2904718Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.2905502Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:33:01.2906577Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.3776168Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:33:01.3777517Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:33:01.3778927Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:33:01.3780551Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:33:01.3781571Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:01.3782946Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:33:01.3784394Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.3785443Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:01.3786733Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:01.3788183Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.3789298Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:01.3790657Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:33:01.3791985Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:33:01.3793277Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:33:01.3794550Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:33:01.3795408Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:01.3796484Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:33:01.3797559Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:33:01.3798391Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]            ^^^^^^^^^^^^^
2025-05-07T20:33:01.3799657Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:33:01.3801090Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:33:01.3802262Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:33:01.3803434Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:33:01.3804675Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:33:01.3806093Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:33:01.3807205Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.3808154Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.3808929Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:33:01.3809989Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.8416858Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.8417503Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:01.8417812Z 
2025-05-07T20:33:01.8417901Z     @given(
2025-05-07T20:33:01.8418142Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.8418482Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.8418796Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.8419135Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.8419463Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.8419770Z     )
2025-05-07T20:33:01.8420124Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.8420579Z     def test_silu_mul_quant(
2025-05-07T20:33:01.8420818Z         self,
2025-05-07T20:33:01.8421015Z         T: int,
2025-05-07T20:33:01.8421215Z         D: int,
2025-05-07T20:33:01.8421432Z         scale_ub: Optional[float],
2025-05-07T20:33:01.8421707Z         contiguous: bool,
2025-05-07T20:33:01.8421947Z         compiled: bool,
2025-05-07T20:33:01.8422167Z     ) -> None:
2025-05-07T20:33:01.8422382Z         torch.manual_seed(2025)
2025-05-07T20:33:01.8422627Z     
2025-05-07T20:33:01.8422902Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.8423255Z     
2025-05-07T20:33:01.8423449Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.8423735Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.8424050Z         x = x_sign * x_clamp
2025-05-07T20:33:01.8424299Z         x0 = x[:, :D]
2025-05-07T20:33:01.8424508Z         x1 = x[:, D:]
2025-05-07T20:33:01.8424716Z     
2025-05-07T20:33:01.8424910Z         if contiguous:
2025-05-07T20:33:01.8425135Z             x0 = x0.contiguous()
2025-05-07T20:33:01.8425567Z             x1 = x1.contiguous()
2025-05-07T20:33:01.8425827Z     
2025-05-07T20:33:01.8426026Z         if scale_ub is not None:
2025-05-07T20:33:01.8426304Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.8426693Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.8427032Z             )
2025-05-07T20:33:01.8427229Z         else:
2025-05-07T20:33:01.8427447Z             scale_ub_tensor = None
2025-05-07T20:33:01.8427701Z     
2025-05-07T20:33:01.8428156Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.8428487Z             op = silu_mul_quant
2025-05-07T20:33:01.8428739Z             if compiled:
2025-05-07T20:33:01.8428984Z                 op = torch.compile(op)
2025-05-07T20:33:01.8429445Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.8429736Z     
2025-05-07T20:33:01.8429928Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.8430101Z 
2025-05-07T20:33:01.8430206Z moe/activation_test.py:117: 
2025-05-07T20:33:01.8430513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.8430859Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.8431147Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.8431877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.8432606Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.8433177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.8433898Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.8434629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.8435194Z     kernel = self.compile(
2025-05-07T20:33:01.8435763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.8436458Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.8436866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.8437110Z 
2025-05-07T20:33:01.8437324Z self = <triton.compiler.compiler.ASTSource object at 0x7f994998ad80>
2025-05-07T20:33:01.8438455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.8439898Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f99498a8180>}
2025-05-07T20:33:01.8441305Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.8442390Z context = <triton._C.libtriton.ir.context object at 0x7f99493e14b0>
2025-05-07T20:33:01.8442698Z 
2025-05-07T20:33:01.8442870Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.8443416Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.8443904Z                            module_map=module_map)
2025-05-07T20:33:01.8444283Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.8444650Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.8444922Z E       ^
2025-05-07T20:33:01.8445399Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.8445887Z 
2025-05-07T20:33:01.8446327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.8446871Z 
2025-05-07T20:33:01.8446983Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.8447408Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.8447832Z     T=2048,
2025-05-07T20:33:01.8448028Z     D=5120,
2025-05-07T20:33:01.8448235Z     scale_ub=1200.0,
2025-05-07T20:33:01.8448460Z     contiguous=True,
2025-05-07T20:33:01.8448691Z     compiled=True,
2025-05-07T20:33:01.8448904Z )
2025-05-07T20:33:01.8449315Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.8449833Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:01.8450118Z 
2025-05-07T20:33:01.8450207Z     @given(
2025-05-07T20:33:01.8450513Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.8450841Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.8451160Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.8451495Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.8451840Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.8452144Z     )
2025-05-07T20:33:01.8452509Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.8452967Z     def test_silu_mul_quant(
2025-05-07T20:33:01.8453214Z         self,
2025-05-07T20:33:01.8453413Z         T: int,
2025-05-07T20:33:01.8453610Z         D: int,
2025-05-07T20:33:01.8453843Z         scale_ub: Optional[float],
2025-05-07T20:33:01.8454123Z         contiguous: bool,
2025-05-07T20:33:01.8454470Z         compiled: bool,
2025-05-07T20:33:01.8454697Z     ) -> None:
2025-05-07T20:33:01.8454911Z         torch.manual_seed(2025)
2025-05-07T20:33:01.8455149Z     
2025-05-07T20:33:01.8455424Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.8455777Z     
2025-05-07T20:33:01.8455963Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.8456253Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.8456566Z         x = x_sign * x_clamp
2025-05-07T20:33:01.8456800Z         x0 = x[:, :D]
2025-05-07T20:33:01.8457013Z         x1 = x[:, D:]
2025-05-07T20:33:01.8457222Z     
2025-05-07T20:33:01.8457405Z         if contiguous:
2025-05-07T20:33:01.8457632Z             x0 = x0.contiguous()
2025-05-07T20:33:01.8457891Z             x1 = x1.contiguous()
2025-05-07T20:33:01.8458135Z     
2025-05-07T20:33:01.8458321Z         if scale_ub is not None:
2025-05-07T20:33:01.8458601Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.8458940Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.8459245Z             )
2025-05-07T20:33:01.8459450Z         else:
2025-05-07T20:33:01.8459676Z             scale_ub_tensor = None
2025-05-07T20:33:01.8459933Z     
2025-05-07T20:33:01.8460167Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.8460485Z             op = silu_mul_quant
2025-05-07T20:33:01.8460732Z             if compiled:
2025-05-07T20:33:01.8460980Z                 op = torch.compile(op)
2025-05-07T20:33:01.8461282Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.8461561Z     
2025-05-07T20:33:01.8461755Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.8462049Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.8462348Z     
2025-05-07T20:33:01.8462577Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.8462925Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.8463224Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.8463538Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.8463907Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.8464227Z     
2025-05-07T20:33:01.8464426Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.8464633Z 
2025-05-07T20:33:01.8464729Z moe/activation_test.py:126: 
2025-05-07T20:33:01.8465033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.8465383Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.8465711Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.8466536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.8467326Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.8467974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.8468695Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.8469420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.8470255Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.8471020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.8471698Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.8472332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.8472879Z     fn()
2025-05-07T20:33:01.8473413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.8474030Z     self.fn.run(
2025-05-07T20:33:01.8474516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.8475066Z     kernel = self.compile(
2025-05-07T20:33:01.8475636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.8476320Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.8476732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.8476992Z 
2025-05-07T20:33:01.8477225Z self = <triton.compiler.compiler.ASTSource object at 0x7f9943d04320>
2025-05-07T20:33:01.8478350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.8479770Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9948439580>}
2025-05-07T20:33:01.8481169Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.8488681Z context = <triton._C.libtriton.ir.context object at 0x7f9943b553b0>
2025-05-07T20:33:01.8489038Z 
2025-05-07T20:33:01.8489224Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.8489843Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.8490330Z                            module_map=module_map)
2025-05-07T20:33:01.8490703Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.8491082Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.8491362Z E       ^
2025-05-07T20:33:01.8491835Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.8492316Z 
2025-05-07T20:33:01.8492765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.8493318Z 
2025-05-07T20:33:01.8493423Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.8493855Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.8494271Z     T=16384,
2025-05-07T20:33:01.8494534Z     D=7168,
2025-05-07T20:33:01.8494735Z     scale_ub=1200.0,
2025-05-07T20:33:01.8494957Z     contiguous=False,
2025-05-07T20:33:01.8495189Z     compiled=False,
2025-05-07T20:33:01.8495403Z )
2025-05-07T20:33:02.0962912Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:33:02.0965148Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:33:02.0967430Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:33:02.0969056Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:33:02.0970090Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:02.0971478Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:33:02.0972935Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.0973978Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:02.0975359Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:02.0976827Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.0977952Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:02.0979310Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:33:02.0980636Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:33:02.0981930Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:33:02.0983220Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:33:02.0984093Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:02.0985173Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:33:02.0986262Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:33:02.0987092Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^
2025-05-07T20:33:02.0988371Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:33:02.0989797Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:33:02.0990982Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:33:02.0992165Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:33:02.0993414Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:33:02.0994855Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:33:02.0995969Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.0996979Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.0997767Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:33:02.0998848Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.1597087Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:33:02.1598207Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:33:02.1599604Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:33:02.1601106Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:33:02.1602132Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:02.1603520Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:33:02.1604983Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.1606014Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:02.1607365Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:02.1608826Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.1610113Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:02.1611472Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:33:02.1612891Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:33:02.1614182Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:33:02.1615554Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:33:02.1616435Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:02.1617512Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:33:02.1618584Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:33:02.1619416Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]            ^^^^^^^^^^^^^
2025-05-07T20:33:02.1620687Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:33:02.1622048Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:33:02.1623217Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:33:02.1624319Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:33:02.1625787Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:33:02.1627224Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:33:02.1628344Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.1629295Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.1630078Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:33:02.1631159Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.6724677Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.6725256Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:02.6725761Z 
2025-05-07T20:33:02.6725874Z     @given(
2025-05-07T20:33:02.6726208Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.6726579Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.6727082Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.6727438Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.6727776Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.6728079Z     )
2025-05-07T20:33:02.6728588Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.6729045Z     def test_silu_mul_quant(
2025-05-07T20:33:02.6729301Z         self,
2025-05-07T20:33:02.6729505Z         T: int,
2025-05-07T20:33:02.6729703Z         D: int,
2025-05-07T20:33:02.6729930Z         scale_ub: Optional[float],
2025-05-07T20:33:02.6730217Z         contiguous: bool,
2025-05-07T20:33:02.6730459Z         compiled: bool,
2025-05-07T20:33:02.6730695Z     ) -> None:
2025-05-07T20:33:02.6730917Z         torch.manual_seed(2025)
2025-05-07T20:33:02.6731157Z     
2025-05-07T20:33:02.6731449Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.6731821Z     
2025-05-07T20:33:02.6732031Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.6732351Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.6732686Z         x = x_sign * x_clamp
2025-05-07T20:33:02.6732948Z         x0 = x[:, :D]
2025-05-07T20:33:02.6733180Z         x1 = x[:, D:]
2025-05-07T20:33:02.6733419Z     
2025-05-07T20:33:02.6733624Z         if contiguous:
2025-05-07T20:33:02.6733861Z             x0 = x0.contiguous()
2025-05-07T20:33:02.6734144Z             x1 = x1.contiguous()
2025-05-07T20:33:02.6734574Z     
2025-05-07T20:33:02.6734777Z         if scale_ub is not None:
2025-05-07T20:33:02.6735073Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.6735443Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.6735763Z             )
2025-05-07T20:33:02.6735981Z         else:
2025-05-07T20:33:02.6736215Z             scale_ub_tensor = None
2025-05-07T20:33:02.6736474Z     
2025-05-07T20:33:02.6736777Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.6737119Z             op = silu_mul_quant
2025-05-07T20:33:02.6737376Z             if compiled:
2025-05-07T20:33:02.6737644Z                 op = torch.compile(op)
2025-05-07T20:33:02.6737957Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.6738251Z     
2025-05-07T20:33:02.6738443Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.6738619Z 
2025-05-07T20:33:02.6738723Z moe/activation_test.py:117: 
2025-05-07T20:33:02.6739031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.6739375Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.6739674Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.6740410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.6741143Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.6741720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.6742445Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.6743148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.6743715Z     kernel = self.compile(
2025-05-07T20:33:02.6744288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.6744989Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.6745402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.6745643Z 
2025-05-07T20:33:02.6745857Z self = <triton.compiler.compiler.ASTSource object at 0x7f9943c8fbc0>
2025-05-07T20:33:02.6747073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.6748510Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9948439c60>}
2025-05-07T20:33:02.6749995Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.6751085Z context = <triton._C.libtriton.ir.context object at 0x7f9943b92e30>
2025-05-07T20:33:02.6751399Z 
2025-05-07T20:33:02.6751576Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.6752133Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.6752634Z                            module_map=module_map)
2025-05-07T20:33:02.6753017Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.6753391Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.6753674Z E       ^
2025-05-07T20:33:02.6754161Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.6754656Z 
2025-05-07T20:33:02.6755098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.6755656Z 
2025-05-07T20:33:02.6755769Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.6756207Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.6756686Z     T=1,
2025-05-07T20:33:02.6756892Z     D=7168,
2025-05-07T20:33:02.6757106Z     scale_ub=None,
2025-05-07T20:33:02.6757327Z     contiguous=True,
2025-05-07T20:33:02.6757577Z     compiled=True,
2025-05-07T20:33:02.6757803Z )
2025-05-07T20:33:02.6758140Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.6758681Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:02.6758966Z 
2025-05-07T20:33:02.6759050Z     @given(
2025-05-07T20:33:02.6759298Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.6759742Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.6760065Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.6760411Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.6760764Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.6761056Z     )
2025-05-07T20:33:02.6761418Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.6761879Z     def test_silu_mul_quant(
2025-05-07T20:33:02.6762123Z         self,
2025-05-07T20:33:02.6762332Z         T: int,
2025-05-07T20:33:02.6762543Z         D: int,
2025-05-07T20:33:02.6762766Z         scale_ub: Optional[float],
2025-05-07T20:33:02.6763065Z         contiguous: bool,
2025-05-07T20:33:02.6763317Z         compiled: bool,
2025-05-07T20:33:02.6763552Z     ) -> None:
2025-05-07T20:33:02.6763771Z         torch.manual_seed(2025)
2025-05-07T20:33:02.6764020Z     
2025-05-07T20:33:02.6764303Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.6764669Z     
2025-05-07T20:33:02.6764871Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.6765164Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.6765489Z         x = x_sign * x_clamp
2025-05-07T20:33:02.6765735Z         x0 = x[:, :D]
2025-05-07T20:33:02.6765949Z         x1 = x[:, D:]
2025-05-07T20:33:02.6766160Z     
2025-05-07T20:33:02.6766347Z         if contiguous:
2025-05-07T20:33:02.6766582Z             x0 = x0.contiguous()
2025-05-07T20:33:02.6766842Z             x1 = x1.contiguous()
2025-05-07T20:33:02.6767096Z     
2025-05-07T20:33:02.6767291Z         if scale_ub is not None:
2025-05-07T20:33:02.6767565Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.6768000Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.6768323Z             )
2025-05-07T20:33:02.6768514Z         else:
2025-05-07T20:33:02.6768732Z             scale_ub_tensor = None
2025-05-07T20:33:02.6768986Z     
2025-05-07T20:33:02.6769296Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.6769619Z             op = silu_mul_quant
2025-05-07T20:33:02.6769877Z             if compiled:
2025-05-07T20:33:02.6770117Z                 op = torch.compile(op)
2025-05-07T20:33:02.6770422Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.6770708Z     
2025-05-07T20:33:02.6772373Z         y_fp8, y_scale = fn()
2025-05-07T20:33:02.6772661Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:02.6772962Z     
2025-05-07T20:33:02.6773203Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.6773537Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:02.6773843Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:02.6774169Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:02.6774659Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:02.6774984Z     
2025-05-07T20:33:02.6775190Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:02.6775397Z 
2025-05-07T20:33:02.6775495Z moe/activation_test.py:126: 
2025-05-07T20:33:02.6775793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.6776135Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:02.6776466Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:02.6777287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:02.6778084Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:02.6778659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.6779371Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.6780092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:02.6780857Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:02.6781625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:02.6782291Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:02.6782922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:02.6783465Z     fn()
2025-05-07T20:33:02.6783995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:02.6784603Z     self.fn.run(
2025-05-07T20:33:02.6785095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.6785652Z     kernel = self.compile(
2025-05-07T20:33:02.6786211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.6786908Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.6787327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.6787566Z 
2025-05-07T20:33:02.6787784Z self = <triton.compiler.compiler.ASTSource object at 0x7f9943f7e180>
2025-05-07T20:33:02.6788902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.6790407Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f994843ad40>}
2025-05-07T20:33:02.6791822Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.6792987Z context = <triton._C.libtriton.ir.context object at 0x7f9943a8a8f0>
2025-05-07T20:33:02.6793289Z 
2025-05-07T20:33:02.6793470Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.6794010Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.6794499Z                            module_map=module_map)
2025-05-07T20:33:02.6794878Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.6795248Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:02.6795526Z E       ^
2025-05-07T20:33:02.6796014Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.6796485Z 
2025-05-07T20:33:02.6796931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.6797479Z 
2025-05-07T20:33:02.6797585Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.6798012Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.6798435Z     T=4096,
2025-05-07T20:33:02.6798617Z     D=5120,
2025-05-07T20:33:02.6798812Z     scale_ub=None,
2025-05-07T20:33:02.6799033Z     contiguous=False,
2025-05-07T20:33:02.6799259Z     compiled=False,
2025-05-07T20:33:02.6799465Z )
2025-05-07T20:33:03.1575278Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:33:03.1576405Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:33:03.1578621Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:33:03.1581622Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:33:03.1583677Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:03.1586428Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:33:03.1588047Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.1589086Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:03.1590375Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:03.1591830Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.1593122Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:03.1594481Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:33:03.1595952Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:33:03.1597240Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:33:03.1598512Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:33:03.1599388Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:03.1600465Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:33:03.1601542Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:33:03.1602367Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^
2025-05-07T20:33:03.1603644Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:33:03.1604997Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:33:03.1606169Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:33:03.1607273Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:33:03.1608509Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:33:03.1609941Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:33:03.1611061Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.1612013Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.1612779Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:33:03.1613851Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.3769908Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:33:03.3771024Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:33:03.3772624Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:33:03.3774230Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:33:03.3775351Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:03.3776724Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:33:03.3778192Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.3779223Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:03.3780520Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:03.3781968Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.3783089Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:03.3784450Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:33:03.3785775Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:33:03.3787071Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:33:03.3788341Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:33:03.3789216Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:03.3790301Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:33:03.3791379Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:33:03.3792225Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]            ^^^^^^^^^^^^^
2025-05-07T20:33:03.3793509Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:33:03.3794868Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:33:03.3796135Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:33:03.3797290Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:33:03.3798603Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:33:03.3800037Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:33:03.3801158Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.3802124Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.3802903Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:33:03.3803973Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.9786492Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.9787071Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:03.9787384Z 
2025-05-07T20:33:03.9787466Z     @given(
2025-05-07T20:33:03.9787706Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.9788032Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.9788383Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.9788738Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.9789075Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.9789375Z     )
2025-05-07T20:33:03.9789729Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.9790184Z     def test_silu_mul_quant(
2025-05-07T20:33:03.9790432Z         self,
2025-05-07T20:33:03.9790632Z         T: int,
2025-05-07T20:33:03.9790823Z         D: int,
2025-05-07T20:33:03.9791039Z         scale_ub: Optional[float],
2025-05-07T20:33:03.9791311Z         contiguous: bool,
2025-05-07T20:33:03.9791552Z         compiled: bool,
2025-05-07T20:33:03.9791769Z     ) -> None:
2025-05-07T20:33:03.9791983Z         torch.manual_seed(2025)
2025-05-07T20:33:03.9792224Z     
2025-05-07T20:33:03.9792493Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.9792848Z     
2025-05-07T20:33:03.9793043Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.9793335Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.9793656Z         x = x_sign * x_clamp
2025-05-07T20:33:03.9793898Z         x0 = x[:, :D]
2025-05-07T20:33:03.9794108Z         x1 = x[:, D:]
2025-05-07T20:33:03.9794314Z     
2025-05-07T20:33:03.9794496Z         if contiguous:
2025-05-07T20:33:03.9794728Z             x0 = x0.contiguous()
2025-05-07T20:33:03.9794989Z             x1 = x1.contiguous()
2025-05-07T20:33:03.9795232Z     
2025-05-07T20:33:03.9795416Z         if scale_ub is not None:
2025-05-07T20:33:03.9795689Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.9796026Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.9796339Z             )
2025-05-07T20:33:03.9796527Z         else:
2025-05-07T20:33:03.9796736Z             scale_ub_tensor = None
2025-05-07T20:33:03.9796992Z     
2025-05-07T20:33:03.9797226Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.9802881Z             op = silu_mul_quant
2025-05-07T20:33:03.9803152Z             if compiled:
2025-05-07T20:33:03.9803605Z                 op = torch.compile(op)
2025-05-07T20:33:03.9803920Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.9804207Z     
2025-05-07T20:33:03.9804409Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.9804699Z 
2025-05-07T20:33:03.9804802Z moe/activation_test.py:117: 
2025-05-07T20:33:03.9805111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.9805458Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.9805753Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.9806481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.9807220Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.9807780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.9808512Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.9809210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.9809769Z     kernel = self.compile(
2025-05-07T20:33:03.9810343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.9811046Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.9811465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.9811702Z 
2025-05-07T20:33:03.9811911Z self = <triton.compiler.compiler.ASTSource object at 0x7f99484d9e50>
2025-05-07T20:33:03.9813050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.9814647Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f994386c4a0>}
2025-05-07T20:33:03.9816057Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.9817149Z context = <triton._C.libtriton.ir.context object at 0x7f9943aa01f0>
2025-05-07T20:33:03.9817453Z 
2025-05-07T20:33:03.9817623Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.9818158Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.9818644Z                            module_map=module_map)
2025-05-07T20:33:03.9819017Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.9819384Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.9819657Z E       ^
2025-05-07T20:33:03.9820139Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.9820621Z 
2025-05-07T20:33:03.9821061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.9821618Z 
2025-05-07T20:33:03.9821726Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.9822149Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.9822558Z     T=4096,
2025-05-07T20:33:03.9822756Z     D=7168,
2025-05-07T20:33:03.9822952Z     scale_ub=None,
2025-05-07T20:33:03.9823168Z     contiguous=False,
2025-05-07T20:33:03.9823400Z     compiled=False,
2025-05-07T20:33:03.9823614Z )
2025-05-07T20:33:03.9823942Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.9824448Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:03.9824852Z 
2025-05-07T20:33:03.9824937Z     @given(
2025-05-07T20:33:03.9825175Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.9826685Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.9827018Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.9827573Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.9827909Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.9828207Z     )
2025-05-07T20:33:03.9828563Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.9829018Z     def test_silu_mul_quant(
2025-05-07T20:33:03.9829259Z         self,
2025-05-07T20:33:03.9829461Z         T: int,
2025-05-07T20:33:03.9829653Z         D: int,
2025-05-07T20:33:03.9829866Z         scale_ub: Optional[float],
2025-05-07T20:33:03.9830140Z         contiguous: bool,
2025-05-07T20:33:03.9830384Z         compiled: bool,
2025-05-07T20:33:03.9830602Z     ) -> None:
2025-05-07T20:33:03.9830823Z         torch.manual_seed(2025)
2025-05-07T20:33:03.9831067Z     
2025-05-07T20:33:03.9831338Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.9831693Z     
2025-05-07T20:33:03.9831888Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.9832183Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.9832498Z         x = x_sign * x_clamp
2025-05-07T20:33:03.9832740Z         x0 = x[:, :D]
2025-05-07T20:33:03.9832953Z         x1 = x[:, D:]
2025-05-07T20:33:03.9833167Z     
2025-05-07T20:33:03.9833359Z         if contiguous:
2025-05-07T20:33:03.9833586Z             x0 = x0.contiguous()
2025-05-07T20:33:03.9833851Z             x1 = x1.contiguous()
2025-05-07T20:33:03.9834103Z     
2025-05-07T20:33:03.9834293Z         if scale_ub is not None:
2025-05-07T20:33:03.9834565Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.9834911Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.9835232Z             )
2025-05-07T20:33:03.9835427Z         else:
2025-05-07T20:33:03.9835637Z             scale_ub_tensor = None
2025-05-07T20:33:03.9835895Z     
2025-05-07T20:33:03.9836123Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.9836444Z             op = silu_mul_quant
2025-05-07T20:33:03.9836709Z             if compiled:
2025-05-07T20:33:03.9836956Z                 op = torch.compile(op)
2025-05-07T20:33:03.9837263Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.9837540Z     
2025-05-07T20:33:03.9837726Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.9837896Z 
2025-05-07T20:33:03.9837995Z moe/activation_test.py:117: 
2025-05-07T20:33:03.9838299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.9838648Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.9838937Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.9839662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.9840389Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.9840942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.9841664Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.9842358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.9842915Z     kernel = self.compile(
2025-05-07T20:33:03.9843472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.9844161Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.9844574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.9844814Z 
2025-05-07T20:33:03.9845157Z self = <triton.compiler.compiler.ASTSource object at 0x7f99484d9af0>
2025-05-07T20:33:03.9846280Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.9847879Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f994386df80>}
2025-05-07T20:33:03.9849286Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.9850367Z context = <triton._C.libtriton.ir.context object at 0x7f9943ab45b0>
2025-05-07T20:33:03.9850665Z 
2025-05-07T20:33:03.9850834Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.9851376Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.9851855Z                            module_map=module_map)
2025-05-07T20:33:03.9852228Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.9852588Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.9852853Z E       ^
2025-05-07T20:33:03.9853336Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.9853812Z 
2025-05-07T20:33:03.9854247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.9854929Z 
2025-05-07T20:33:03.9855035Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.9855456Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.9855873Z     T=128,
2025-05-07T20:33:03.9856066Z     D=7168,
2025-05-07T20:33:03.9856270Z     scale_ub=None,
2025-05-07T20:33:03.9856498Z     contiguous=False,
2025-05-07T20:33:03.9856721Z     compiled=True,
2025-05-07T20:33:03.9856928Z )
2025-05-07T20:33:04.0422882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.0423399Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:04.0423700Z 
2025-05-07T20:33:04.0423787Z     @given(
2025-05-07T20:33:04.0424050Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.0424372Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.0424688Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.0425016Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.0425353Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.0425937Z     )
2025-05-07T20:33:04.0426398Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.0426992Z     def test_silu_mul_quant(
2025-05-07T20:33:04.0427351Z         self,
2025-05-07T20:33:04.0427542Z         T: int,
2025-05-07T20:33:04.0427734Z         D: int,
2025-05-07T20:33:04.0427952Z         scale_ub: Optional[float],
2025-05-07T20:33:04.0428220Z         contiguous: bool,
2025-05-07T20:33:04.0428466Z         compiled: bool,
2025-05-07T20:33:04.0428694Z     ) -> None:
2025-05-07T20:33:04.0428901Z         torch.manual_seed(2025)
2025-05-07T20:33:04.0429140Z     
2025-05-07T20:33:04.0429418Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.0429770Z     
2025-05-07T20:33:04.0429955Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.0430251Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.0430564Z         x = x_sign * x_clamp
2025-05-07T20:33:04.0430798Z         x0 = x[:, :D]
2025-05-07T20:33:04.0431011Z         x1 = x[:, D:]
2025-05-07T20:33:04.0431218Z     
2025-05-07T20:33:04.0431401Z         if contiguous:
2025-05-07T20:33:04.0431630Z             x0 = x0.contiguous()
2025-05-07T20:33:04.0432561Z             x1 = x1.contiguous()
2025-05-07T20:33:04.0432812Z     
2025-05-07T20:33:04.0433016Z         if scale_ub is not None:
2025-05-07T20:33:04.0433292Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.0433625Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.0434056Z             )
2025-05-07T20:33:04.0434248Z         else:
2025-05-07T20:33:04.0434450Z             scale_ub_tensor = None
2025-05-07T20:33:04.0434709Z     
2025-05-07T20:33:04.0434945Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.0435267Z             op = silu_mul_quant
2025-05-07T20:33:04.0435510Z             if compiled:
2025-05-07T20:33:04.0435752Z                 op = torch.compile(op)
2025-05-07T20:33:04.0436051Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.0436322Z     
2025-05-07T20:33:04.0436510Z         y_fp8, y_scale = fn()
2025-05-07T20:33:04.0436800Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:04.0437092Z     
2025-05-07T20:33:04.0437328Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.0437668Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:04.0437958Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:04.0438285Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:04.0438648Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:04.0438960Z     
2025-05-07T20:33:04.0439152Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:04.0439356Z 
2025-05-07T20:33:04.0439454Z moe/activation_test.py:126: 
2025-05-07T20:33:04.0439749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0440085Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:04.0440412Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:04.0441232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:04.0442016Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:04.0442579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.0443293Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.0444015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:04.0444766Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:04.0445530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:04.0446199Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:04.0446827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:04.0447376Z     fn()
2025-05-07T20:33:04.0447902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:04.0448508Z     self.fn.run(
2025-05-07T20:33:04.0448989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.0449546Z     kernel = self.compile(
2025-05-07T20:33:04.0450100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.0450779Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.0451184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0451423Z 
2025-05-07T20:33:04.0451637Z self = <triton.compiler.compiler.ASTSource object at 0x7f99498af320>
2025-05-07T20:33:04.0452841Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.0454270Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f994386ec00>}
2025-05-07T20:33:04.0455859Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.0456936Z context = <triton._C.libtriton.ir.context object at 0x7f994312ce70>
2025-05-07T20:33:04.0457231Z 
2025-05-07T20:33:04.0457416Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.0457976Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.0458454Z                            module_map=module_map)
2025-05-07T20:33:04.0458843Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.0459203Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:04.0459477Z E       ^
2025-05-07T20:33:04.0459963Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.0460441Z 
2025-05-07T20:33:04.0460883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.0461424Z 
2025-05-07T20:33:04.0461526Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.0461945Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.0462357Z     T=128,
2025-05-07T20:33:04.0462537Z     D=7168,
2025-05-07T20:33:04.0462730Z     scale_ub=None,
2025-05-07T20:33:04.0462939Z     contiguous=False,
2025-05-07T20:33:04.0463158Z     compiled=False,
2025-05-07T20:33:04.0463364Z )
2025-05-07T20:33:04.2448103Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.2449193Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:04.2449759Z 
2025-05-07T20:33:04.2449936Z     @given(
2025-05-07T20:33:04.2450408Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.2451026Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.2451625Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.2452279Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.2452923Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.2453488Z     )
2025-05-07T20:33:04.2454175Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.2455200Z     def test_silu_mul_quant(
2025-05-07T20:33:04.2455674Z         self,
2025-05-07T20:33:04.2456049Z         T: int,
2025-05-07T20:33:04.2456429Z         D: int,
2025-05-07T20:33:04.2456863Z         scale_ub: Optional[float],
2025-05-07T20:33:04.2457399Z         contiguous: bool,
2025-05-07T20:33:04.2457688Z         compiled: bool,
2025-05-07T20:33:04.2457935Z     ) -> None:
2025-05-07T20:33:04.2458153Z         torch.manual_seed(2025)
2025-05-07T20:33:04.2458393Z     
2025-05-07T20:33:04.2458665Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.2459010Z     
2025-05-07T20:33:04.2459201Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.2459486Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.2459800Z         x = x_sign * x_clamp
2025-05-07T20:33:04.2460041Z         x0 = x[:, :D]
2025-05-07T20:33:04.2460255Z         x1 = x[:, D:]
2025-05-07T20:33:04.2460461Z     
2025-05-07T20:33:04.2460646Z         if contiguous:
2025-05-07T20:33:04.2460873Z             x0 = x0.contiguous()
2025-05-07T20:33:04.2461133Z             x1 = x1.contiguous()
2025-05-07T20:33:04.2461377Z     
2025-05-07T20:33:04.2461564Z         if scale_ub is not None:
2025-05-07T20:33:04.2461999Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.2462343Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.2462651Z             )
2025-05-07T20:33:04.2462843Z         else:
2025-05-07T20:33:04.2463057Z             scale_ub_tensor = None
2025-05-07T20:33:04.2463426Z     
2025-05-07T20:33:04.2463656Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.2463973Z             op = silu_mul_quant
2025-05-07T20:33:04.2464225Z             if compiled:
2025-05-07T20:33:04.2464467Z                 op = torch.compile(op)
2025-05-07T20:33:04.2464767Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.2465047Z     
2025-05-07T20:33:04.2465235Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.2465405Z 
2025-05-07T20:33:04.2465501Z moe/activation_test.py:117: 
2025-05-07T20:33:04.2465798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.2466130Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.2466419Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.2467141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.2467870Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.2468423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.2469140Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.2469832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.2470381Z     kernel = self.compile(
2025-05-07T20:33:04.2470942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.2471627Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.2472038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.2472277Z 
2025-05-07T20:33:04.2472485Z self = <triton.compiler.compiler.ASTSource object at 0x7f994366d2e0>
2025-05-07T20:33:04.2473605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.2475029Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9943d379c0>}
2025-05-07T20:33:04.2476436Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.2477516Z context = <triton._C.libtriton.ir.context object at 0x7f994378cdf0>
2025-05-07T20:33:04.2477812Z 
2025-05-07T20:33:04.2477980Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.2478514Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.2478994Z                            module_map=module_map)
2025-05-07T20:33:04.2479359Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.2479716Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.2479976Z E       ^
2025-05-07T20:33:04.2480454Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.2480924Z 
2025-05-07T20:33:04.2481361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.2481908Z 
2025-05-07T20:33:04.2482011Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.2482547Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.2482957Z     T=4096,
2025-05-07T20:33:04.2483142Z     D=5120,
2025-05-07T20:33:04.2483339Z     scale_ub=1200.0,
2025-05-07T20:33:04.2483557Z     contiguous=True,
2025-05-07T20:33:04.2483768Z     compiled=False,
2025-05-07T20:33:04.2484071Z )
2025-05-07T20:33:04.2484391Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.2484892Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:04.2485178Z 
2025-05-07T20:33:04.2485255Z     @given(
2025-05-07T20:33:04.2485483Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.2485792Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.2486099Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.2486429Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.2486751Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.2487036Z     )
2025-05-07T20:33:04.2487391Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.2487842Z     def test_silu_mul_quant(
2025-05-07T20:33:04.2488075Z         self,
2025-05-07T20:33:04.2488268Z         T: int,
2025-05-07T20:33:04.2488462Z         D: int,
2025-05-07T20:33:04.2488670Z         scale_ub: Optional[float],
2025-05-07T20:33:04.2488936Z         contiguous: bool,
2025-05-07T20:33:04.2489171Z         compiled: bool,
2025-05-07T20:33:04.2489385Z     ) -> None:
2025-05-07T20:33:04.2489595Z         torch.manual_seed(2025)
2025-05-07T20:33:04.2489834Z     
2025-05-07T20:33:04.2490100Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.2490450Z     
2025-05-07T20:33:04.2490640Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.2490923Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.2491230Z         x = x_sign * x_clamp
2025-05-07T20:33:04.2491461Z         x0 = x[:, :D]
2025-05-07T20:33:04.2491673Z         x1 = x[:, D:]
2025-05-07T20:33:04.2491876Z     
2025-05-07T20:33:04.2492062Z         if contiguous:
2025-05-07T20:33:04.2492287Z             x0 = x0.contiguous()
2025-05-07T20:33:04.2492538Z             x1 = x1.contiguous()
2025-05-07T20:33:04.2492785Z     
2025-05-07T20:33:04.2492979Z         if scale_ub is not None:
2025-05-07T20:33:04.2493243Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.2493577Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.2493887Z             )
2025-05-07T20:33:04.2494073Z         else:
2025-05-07T20:33:04.2494280Z             scale_ub_tensor = None
2025-05-07T20:33:04.2494591Z     
2025-05-07T20:33:04.2494815Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.2495131Z             op = silu_mul_quant
2025-05-07T20:33:04.2495381Z             if compiled:
2025-05-07T20:33:04.2495621Z                 op = torch.compile(op)
2025-05-07T20:33:04.2495918Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.2496197Z     
2025-05-07T20:33:04.2496378Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.2496547Z 
2025-05-07T20:33:04.2496646Z moe/activation_test.py:117: 
2025-05-07T20:33:04.2496960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.2497338Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.2497617Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.2498334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.2499057Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.2499610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.2500329Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.2501118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.2501683Z     kernel = self.compile(
2025-05-07T20:33:04.2502248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.2502941Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.2503428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.2503667Z 
2025-05-07T20:33:04.2503881Z self = <triton.compiler.compiler.ASTSource object at 0x7f99438a4740>
2025-05-07T20:33:04.2505001Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.2506430Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f99482e2520>}
2025-05-07T20:33:04.2507887Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.2508978Z context = <triton._C.libtriton.ir.context object at 0x7f994288b730>
2025-05-07T20:33:04.2509278Z 
2025-05-07T20:33:04.2509448Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.2509985Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.2510467Z                            module_map=module_map)
2025-05-07T20:33:04.2510841Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.2511201Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.2511469Z E       ^
2025-05-07T20:33:04.2517318Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.2517825Z 
2025-05-07T20:33:04.2518275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.2518829Z 
2025-05-07T20:33:04.2518933Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.2519364Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.2519782Z     T=1,
2025-05-07T20:33:04.2519967Z     D=5120,
2025-05-07T20:33:04.2520166Z     scale_ub=None,
2025-05-07T20:33:04.2520383Z     contiguous=True,
2025-05-07T20:33:04.2520601Z     compiled=True,
2025-05-07T20:33:04.2520809Z )
2025-05-07T20:33:04.4884699Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:33:04.4885813Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:33:04.4887230Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:33:04.4888787Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:33:04.4889807Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:04.4891179Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:33:04.4892819Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.4893859Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:04.4895385Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:04.4896841Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.4897971Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:04.4899326Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:33:04.4900651Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:33:04.4901938Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:33:04.4903219Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:33:04.4904097Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:04.4905177Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:33:04.4906251Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:33:04.4907095Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^
2025-05-07T20:33:04.4908381Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:33:04.4909739Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:33:04.4910930Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:33:04.4912028Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:33:04.4913287Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:33:04.4914729Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:33:04.4915847Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.4916890Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.4917666Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:33:04.4918871Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.5581855Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:33:04.5582955Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:33:04.5584364Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:33:04.5585857Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:33:04.5586883Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:04.5588307Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:33:04.5589775Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.5590807Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:04.5592095Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:04.5593549Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.5594664Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:04.5596025Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:33:04.5597341Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:33:04.5598628Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:33:04.5599898Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:33:04.5600770Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:04.5602031Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:33:04.5603116Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:33:04.5604051Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]            ^^^^^^^^^^^^^
2025-05-07T20:33:04.5605335Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:33:04.5606692Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:33:04.5607927Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:33:04.5609022Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:33:04.5610273Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:33:04.5611711Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:33:04.5612829Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.5613783Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.5614638Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:33:04.5615711Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.8574454Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.8574961Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:04.8575266Z 
2025-05-07T20:33:04.8575349Z     @given(
2025-05-07T20:33:04.8575594Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.8575917Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.8576230Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.8576569Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.8576907Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.8577201Z     )
2025-05-07T20:33:04.8577559Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.8578019Z     def test_silu_mul_quant(
2025-05-07T20:33:04.8578260Z         self,
2025-05-07T20:33:04.8578457Z         T: int,
2025-05-07T20:33:04.8578650Z         D: int,
2025-05-07T20:33:04.8578869Z         scale_ub: Optional[float],
2025-05-07T20:33:04.8579138Z         contiguous: bool,
2025-05-07T20:33:04.8579379Z         compiled: bool,
2025-05-07T20:33:04.8579606Z     ) -> None:
2025-05-07T20:33:04.8579820Z         torch.manual_seed(2025)
2025-05-07T20:33:04.8580067Z     
2025-05-07T20:33:04.8580349Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.8580704Z     
2025-05-07T20:33:04.8580906Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.8581201Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.8581517Z         x = x_sign * x_clamp
2025-05-07T20:33:04.8581915Z         x0 = x[:, :D]
2025-05-07T20:33:04.8582142Z         x1 = x[:, D:]
2025-05-07T20:33:04.8582347Z     
2025-05-07T20:33:04.8582543Z         if contiguous:
2025-05-07T20:33:04.8582778Z             x0 = x0.contiguous()
2025-05-07T20:33:04.8583035Z             x1 = x1.contiguous()
2025-05-07T20:33:04.8583396Z     
2025-05-07T20:33:04.8583595Z         if scale_ub is not None:
2025-05-07T20:33:04.8583878Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.8584223Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.8584545Z             )
2025-05-07T20:33:04.8584745Z         else:
2025-05-07T20:33:04.8584956Z             scale_ub_tensor = None
2025-05-07T20:33:04.8585216Z     
2025-05-07T20:33:04.8585453Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.8585769Z             op = silu_mul_quant
2025-05-07T20:33:04.8586021Z             if compiled:
2025-05-07T20:33:04.8586273Z                 op = torch.compile(op)
2025-05-07T20:33:04.8586580Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.8586862Z     
2025-05-07T20:33:04.8587056Z         y_fp8, y_scale = fn()
2025-05-07T20:33:04.8587343Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:04.8587646Z     
2025-05-07T20:33:04.8587889Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.8588228Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:04.8588536Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:04.8588861Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:04.8589234Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:04.8589547Z     
2025-05-07T20:33:04.8589748Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:04.8589949Z 
2025-05-07T20:33:04.8590050Z moe/activation_test.py:126: 
2025-05-07T20:33:04.8590349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.8590698Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:04.8591037Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:04.8591860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:04.8592652Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:04.8593224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.8593945Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.8594662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:04.8595434Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:04.8596210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:04.8596892Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:04.8597528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:04.8598087Z     fn()
2025-05-07T20:33:04.8598635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:04.8599262Z     self.fn.run(
2025-05-07T20:33:04.8599753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.8600316Z     kernel = self.compile(
2025-05-07T20:33:04.8600887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.8601578Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.8601993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.8602230Z 
2025-05-07T20:33:04.8602540Z self = <triton.compiler.compiler.ASTSource object at 0x7f99482d1c40>
2025-05-07T20:33:04.8603675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.8605172Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f99484c7d80>}
2025-05-07T20:33:04.8606581Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.8607674Z context = <triton._C.libtriton.ir.context object at 0x7f9942815e70>
2025-05-07T20:33:04.8607976Z 
2025-05-07T20:33:04.8608164Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.8608708Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.8609199Z                            module_map=module_map)
2025-05-07T20:33:04.8609575Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.8609959Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:04.8610235Z E       ^
2025-05-07T20:33:04.8610716Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.8611192Z 
2025-05-07T20:33:04.8611637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.8612181Z 
2025-05-07T20:33:04.8612296Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.8612725Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.8613150Z     T=2048,
2025-05-07T20:33:04.8613359Z     D=5120,
2025-05-07T20:33:04.8613555Z     scale_ub=None,
2025-05-07T20:33:04.8613778Z     contiguous=True,
2025-05-07T20:33:04.8614002Z     compiled=True,
2025-05-07T20:33:04.8614204Z )
2025-05-07T20:33:05.0870189Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:33:05.0871321Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:33:05.0872730Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:33:05.0874243Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:33:05.0875267Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.0876652Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:33:05.0878122Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.0879157Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.0880628Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:05.0882101Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.0883337Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.0884702Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:33:05.0886032Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:33:05.0887326Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:33:05.0888617Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:33:05.0889496Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.0890585Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:33:05.0891658Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:33:05.0892508Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^
2025-05-07T20:33:05.0893785Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:33:05.0895222Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:33:05.0896393Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:33:05.0897502Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:33:05.0898810Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:33:05.0900248Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:33:05.0901372Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.0902319Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.0903104Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:33:05.0904295Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.1556905Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:33:05.1558214Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:33:05.1559620Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:33:05.1561111Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:33:05.1562143Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.1563518Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:33:05.1564985Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.1566018Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.1567310Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:05.1568821Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.1569947Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.1571302Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:33:05.1572623Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:33:05.1573912Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:33:05.1575248Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:33:05.1576123Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.1577208Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:33:05.1578281Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:33:05.1579113Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]            ^^^^^^^^^^^^^
2025-05-07T20:33:05.1580516Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:33:05.1581877Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:33:05.1583128Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:33:05.1584233Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:33:05.1585473Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:33:05.1586903Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:33:05.1588066Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.1589021Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.1589791Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:33:05.1590858Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.4548282Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.4548850Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:05.4549133Z 
2025-05-07T20:33:05.4549224Z     @given(
2025-05-07T20:33:05.4549455Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.4549779Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.4550088Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.4550422Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.4550749Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.4551036Z     )
2025-05-07T20:33:05.4551385Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.4551833Z     def test_silu_mul_quant(
2025-05-07T20:33:05.4552082Z         self,
2025-05-07T20:33:05.4552278Z         T: int,
2025-05-07T20:33:05.4552468Z         D: int,
2025-05-07T20:33:05.4552688Z         scale_ub: Optional[float],
2025-05-07T20:33:05.4552960Z         contiguous: bool,
2025-05-07T20:33:05.4553203Z         compiled: bool,
2025-05-07T20:33:05.4553424Z     ) -> None:
2025-05-07T20:33:05.4553637Z         torch.manual_seed(2025)
2025-05-07T20:33:05.4553868Z     
2025-05-07T20:33:05.4554141Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.4554502Z     
2025-05-07T20:33:05.4554689Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.4554978Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.4555288Z         x = x_sign * x_clamp
2025-05-07T20:33:05.4555527Z         x0 = x[:, :D]
2025-05-07T20:33:05.4555737Z         x1 = x[:, D:]
2025-05-07T20:33:05.4555946Z     
2025-05-07T20:33:05.4556132Z         if contiguous:
2025-05-07T20:33:05.4556360Z             x0 = x0.contiguous()
2025-05-07T20:33:05.4556619Z             x1 = x1.contiguous()
2025-05-07T20:33:05.4556866Z     
2025-05-07T20:33:05.4557049Z         if scale_ub is not None:
2025-05-07T20:33:05.4557322Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.4557823Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.4558137Z             )
2025-05-07T20:33:05.4558329Z         else:
2025-05-07T20:33:05.4558543Z             scale_ub_tensor = None
2025-05-07T20:33:05.4558904Z     
2025-05-07T20:33:05.4559136Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.4559460Z             op = silu_mul_quant
2025-05-07T20:33:05.4559705Z             if compiled:
2025-05-07T20:33:05.4559953Z                 op = torch.compile(op)
2025-05-07T20:33:05.4560251Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.4560537Z     
2025-05-07T20:33:05.4560721Z         y_fp8, y_scale = fn()
2025-05-07T20:33:05.4561006Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:05.4561301Z     
2025-05-07T20:33:05.4561529Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.4561866Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:05.4562168Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:05.4562479Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:05.4562842Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.4563158Z     
2025-05-07T20:33:05.4563358Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:05.4563563Z 
2025-05-07T20:33:05.4563663Z moe/activation_test.py:126: 
2025-05-07T20:33:05.4563965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.4564311Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:05.4564636Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.4565457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:05.4566243Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:05.4566809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.4567522Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.4568244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:05.4569006Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:05.4569764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:05.4570434Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:05.4571061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:05.4571605Z     fn()
2025-05-07T20:33:05.4572130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:05.4572743Z     self.fn.run(
2025-05-07T20:33:05.4573227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.4573775Z     kernel = self.compile(
2025-05-07T20:33:05.4574439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.4575139Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.4575549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.4575787Z 
2025-05-07T20:33:05.4575996Z self = <triton.compiler.compiler.ASTSource object at 0x7f9943d07140>
2025-05-07T20:33:05.4577120Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.4578682Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f994384f060>}
2025-05-07T20:33:05.4580091Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.4581274Z context = <triton._C.libtriton.ir.context object at 0x7f9942c972b0>
2025-05-07T20:33:05.4581576Z 
2025-05-07T20:33:05.4581748Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.4582294Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.4582779Z                            module_map=module_map)
2025-05-07T20:33:05.4583149Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.4583520Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:05.4583797Z E       ^
2025-05-07T20:33:05.4584281Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.4584761Z 
2025-05-07T20:33:05.4585198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.4585749Z 
2025-05-07T20:33:05.4585854Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.4586287Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.4586701Z     T=128,
2025-05-07T20:33:05.4586900Z     D=5120,
2025-05-07T20:33:05.4587086Z     scale_ub=None,
2025-05-07T20:33:05.4587297Z     contiguous=True,
2025-05-07T20:33:05.4587555Z     compiled=True,
2025-05-07T20:33:05.4587767Z )
2025-05-07T20:33:05.7014920Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:33:05.7016022Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:33:05.7017444Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:33:05.7018949Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:33:05.7019964Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.7021339Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:33:05.7022797Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.7023829Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.7025118Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:05.7026730Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.7028003Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.7029367Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:33:05.7030795Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:33:05.7032085Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:33:05.7033364Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:33:05.7034233Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.7035317Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:33:05.7036399Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:33:05.7037234Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^
2025-05-07T20:33:05.7038555Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:33:05.7039912Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:33:05.7041091Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:33:05.7042201Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:33:05.7043451Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:33:05.7044877Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:33:05.7045995Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.7046949Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.7047728Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:33:05.7048842Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.7713093Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:33:05.7714199Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:33:05.7721405Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:33:05.7723036Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:33:05.7724067Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.7725634Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:33:05.7727109Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.7728154Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.7729458Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:05.7730917Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.7732047Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.7733408Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:33:05.7734807Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:33:05.7736097Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:33:05.7737375Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:33:05.7738239Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.7739312Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:33:05.7740380Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:33:05.7741215Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]            ^^^^^^^^^^^^^
2025-05-07T20:33:05.7742496Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:33:05.7743846Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:33:05.7745139Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:33:05.7746244Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:33:05.7747595Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:33:05.7749028Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:33:05.7750140Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.7751103Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.7751882Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:33:05.7752961Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.1169375Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.1169903Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:06.1170189Z 
2025-05-07T20:33:06.1170270Z     @given(
2025-05-07T20:33:06.1170514Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.1170849Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.1171159Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.1171532Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.1171869Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.1172164Z     )
2025-05-07T20:33:06.1172518Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.1172983Z     def test_silu_mul_quant(
2025-05-07T20:33:06.1173237Z         self,
2025-05-07T20:33:06.1173433Z         T: int,
2025-05-07T20:33:06.1173632Z         D: int,
2025-05-07T20:33:06.1173852Z         scale_ub: Optional[float],
2025-05-07T20:33:06.1174127Z         contiguous: bool,
2025-05-07T20:33:06.1174499Z         compiled: bool,
2025-05-07T20:33:06.1174735Z     ) -> None:
2025-05-07T20:33:06.1174951Z         torch.manual_seed(2025)
2025-05-07T20:33:06.1175198Z     
2025-05-07T20:33:06.1175482Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.1175832Z     
2025-05-07T20:33:06.1176033Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.1176341Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.1176664Z         x = x_sign * x_clamp
2025-05-07T20:33:06.1176907Z         x0 = x[:, :D]
2025-05-07T20:33:06.1177127Z         x1 = x[:, D:]
2025-05-07T20:33:06.1177342Z     
2025-05-07T20:33:06.1177526Z         if contiguous:
2025-05-07T20:33:06.1177771Z             x0 = x0.contiguous()
2025-05-07T20:33:06.1178041Z             x1 = x1.contiguous()
2025-05-07T20:33:06.1178282Z     
2025-05-07T20:33:06.1178477Z         if scale_ub is not None:
2025-05-07T20:33:06.1178753Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.1179092Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.1179408Z             )
2025-05-07T20:33:06.1179607Z         else:
2025-05-07T20:33:06.1179813Z             scale_ub_tensor = None
2025-05-07T20:33:06.1180073Z     
2025-05-07T20:33:06.1180307Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.1180626Z             op = silu_mul_quant
2025-05-07T20:33:06.1181043Z             if compiled:
2025-05-07T20:33:06.1181303Z                 op = torch.compile(op)
2025-05-07T20:33:06.1181607Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.1181888Z     
2025-05-07T20:33:06.1182084Z         y_fp8, y_scale = fn()
2025-05-07T20:33:06.1182526Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:06.1182820Z     
2025-05-07T20:33:06.1183062Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.1183411Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:06.1183706Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:06.1184032Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:06.1184407Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:06.1184724Z     
2025-05-07T20:33:06.1184930Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:06.1185131Z 
2025-05-07T20:33:06.1185239Z moe/activation_test.py:126: 
2025-05-07T20:33:06.1185555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.1185899Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:06.1186235Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:06.1187058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:06.1187852Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:06.1188418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.1189136Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.1189856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:06.1190609Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:06.1191381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:06.1192050Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:06.1192680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:06.1193227Z     fn()
2025-05-07T20:33:06.1193765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:06.1194383Z     self.fn.run(
2025-05-07T20:33:06.1194870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.1195434Z     kernel = self.compile(
2025-05-07T20:33:06.1195997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.1196688Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.1197100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.1197349Z 
2025-05-07T20:33:06.1197559Z self = <triton.compiler.compiler.ASTSource object at 0x7f994825d2e0>
2025-05-07T20:33:06.1198687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.1200123Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9942aa9d00>}
2025-05-07T20:33:06.1201533Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.1202705Z context = <triton._C.libtriton.ir.context object at 0x7f99425058f0>
2025-05-07T20:33:06.1203021Z 
2025-05-07T20:33:06.1203199Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.1203750Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.1204312Z                            module_map=module_map)
2025-05-07T20:33:06.1204696Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.1205069Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:06.1205348Z E       ^
2025-05-07T20:33:06.1205838Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.1206319Z 
2025-05-07T20:33:06.1206758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.1207302Z 
2025-05-07T20:33:06.1207411Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.1207891Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.1208319Z     T=4096,
2025-05-07T20:33:06.1208507Z     D=5120,
2025-05-07T20:33:06.1208708Z     scale_ub=None,
2025-05-07T20:33:06.1208934Z     contiguous=True,
2025-05-07T20:33:06.1209174Z     compiled=True,
2025-05-07T20:33:06.1209383Z )
2025-05-07T20:33:06.3640623Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:33:06.3642817Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:33:06.3645606Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:33:06.3648182Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:33:06.3649202Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:06.3650583Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:33:06.3652039Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.3653067Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:06.3654428Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:06.3655891Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.3657009Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:06.3658409Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:33:06.3659884Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:33:06.3661184Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:33:06.3662571Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:33:06.3663442Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:06.3664526Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:33:06.3665603Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:33:06.3666435Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^
2025-05-07T20:33:06.3667720Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:33:06.3669076Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:33:06.3670244Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:33:06.3671349Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:33:06.3672591Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:33:06.3674027Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:33:06.3675140Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.3676094Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.3676872Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:33:06.3677949Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.4340657Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:33:06.4341772Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:33:06.4343163Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:33:06.4344818Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:33:06.4345843Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:06.4347214Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:33:06.4348799Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.4349825Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:06.4351125Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:06.4352577Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.4353703Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:06.4355046Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:33:06.4356364Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:33:06.4357655Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:33:06.4358933Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:33:06.4359803Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]           ^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:06.4360882Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit
2025-05-07T20:33:06.4361944Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:33:06.4362780Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]            ^^^^^^^^^^^^^
2025-05-07T20:33:06.4364055Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:33:06.4365419Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:33:06.4366590Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit
2025-05-07T20:33:06.4367687Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:33:06.4369028Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:33:06.4370474Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:33:06.4371775Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.4372728Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:33:06.4373503Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:33:06.4374717Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.7764629Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.7765159Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:06.7765459Z 
2025-05-07T20:33:06.7765583Z     @given(
2025-05-07T20:33:06.7765828Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.7766149Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.7766463Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.7766806Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.7767141Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.7767439Z     )
2025-05-07T20:33:06.7767826Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.7768304Z     def test_silu_mul_quant(
2025-05-07T20:33:06.7768553Z         self,
2025-05-07T20:33:06.7768754Z         T: int,
2025-05-07T20:33:06.7768943Z         D: int,
2025-05-07T20:33:06.7769173Z         scale_ub: Optional[float],
2025-05-07T20:33:06.7769451Z         contiguous: bool,
2025-05-07T20:33:06.7769699Z         compiled: bool,
2025-05-07T20:33:06.7769942Z     ) -> None:
2025-05-07T20:33:06.7770157Z         torch.manual_seed(2025)
2025-05-07T20:33:06.7770414Z     
2025-05-07T20:33:06.7770697Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.7771048Z     
2025-05-07T20:33:06.7771243Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.7771539Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.7771856Z         x = x_sign * x_clamp
2025-05-07T20:33:06.7772100Z         x0 = x[:, :D]
2025-05-07T20:33:06.7772315Z         x1 = x[:, D:]
2025-05-07T20:33:06.7772521Z     
2025-05-07T20:33:06.7772708Z         if contiguous:
2025-05-07T20:33:06.7772940Z             x0 = x0.contiguous()
2025-05-07T20:33:06.7773199Z             x1 = x1.contiguous()
2025-05-07T20:33:06.7773442Z     
2025-05-07T20:33:06.7773640Z         if scale_ub is not None:
2025-05-07T20:33:06.7773911Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.7774249Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.7774678Z             )
2025-05-07T20:33:06.7774876Z         else:
2025-05-07T20:33:06.7775090Z             scale_ub_tensor = None
2025-05-07T20:33:06.7775349Z     
2025-05-07T20:33:06.7775583Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.7775899Z             op = silu_mul_quant
2025-05-07T20:33:06.7776153Z             if compiled:
2025-05-07T20:33:06.7776404Z                 op = torch.compile(op)
2025-05-07T20:33:06.7776704Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.7776984Z     
2025-05-07T20:33:06.7777181Z         y_fp8, y_scale = fn()
2025-05-07T20:33:06.7777467Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:06.7777768Z     
2025-05-07T20:33:06.7778005Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.7778501Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:06.7778814Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:06.7779146Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:06.7779638Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:06.7779956Z     
2025-05-07T20:33:06.7780160Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:06.7780359Z 
2025-05-07T20:33:06.7780464Z moe/activation_test.py:126: 
2025-05-07T20:33:06.7780766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.7781114Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:06.7781453Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:06.7782274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:06.7783080Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:06.7783655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.7784376Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.7785100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:06.7785865Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:06.7786643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:06.7787316Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:06.7788175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:06.7788732Z     fn()
2025-05-07T20:33:06.7789274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:06.7789887Z     self.fn.run(
2025-05-07T20:33:06.7790387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.7790956Z     kernel = self.compile(
2025-05-07T20:33:06.7791523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.7792207Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.7792628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.7792871Z 
2025-05-07T20:33:06.7793089Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942773650>
2025-05-07T20:33:06.7794225Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.7795645Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9942710ae0>}
2025-05-07T20:33:06.7797060Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.7798151Z context = <triton._C.libtriton.ir.context object at 0x7f9943308cb0>
2025-05-07T20:33:06.7798452Z 
2025-05-07T20:33:06.7798629Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.7799170Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.7799660Z                            module_map=module_map)
2025-05-07T20:33:06.7800035Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.7800494Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:06.7800766Z E       ^
2025-05-07T20:33:06.7801242Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.7801796Z 
2025-05-07T20:33:06.7802244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.7802790Z 
2025-05-07T20:33:06.7802898Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.7803336Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.7803766Z     T=16384,
2025-05-07T20:33:06.7803966Z     D=5120,
2025-05-07T20:33:06.7804157Z     scale_ub=None,
2025-05-07T20:33:06.7804369Z     contiguous=True,
2025-05-07T20:33:06.7804597Z     compiled=True,
2025-05-07T20:33:06.7804796Z )
2025-05-07T20:33:06.8069059Z W0507 20:33:06.805000 95972 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:33:06.8070369Z W0507 20:33:06.805000 95972 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:33:06.8071781Z W0507 20:33:06.805000 95972 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:33:06.8072821Z W0507 20:33:06.805000 95972 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:33:06.8073986Z W0507 20:33:06.805000 95972 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:33:06.8958378Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:06.8958957Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:06.8959242Z 
2025-05-07T20:33:06.8959330Z     @given(
2025-05-07T20:33:06.8959564Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:06.8959886Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:06.8960207Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:06.8960555Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:06.8960893Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:06.8961194Z     )
2025-05-07T20:33:06.8961552Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:06.8962006Z     def test_silu_mul_quant(
2025-05-07T20:33:06.8962251Z         self,
2025-05-07T20:33:06.8962442Z         T: int,
2025-05-07T20:33:06.8962630Z         D: int,
2025-05-07T20:33:06.8962846Z         scale_ub: Optional[float],
2025-05-07T20:33:06.8963114Z         contiguous: bool,
2025-05-07T20:33:06.8963353Z         compiled: bool,
2025-05-07T20:33:06.8963580Z     ) -> None:
2025-05-07T20:33:06.8963789Z         torch.manual_seed(2025)
2025-05-07T20:33:06.8964025Z     
2025-05-07T20:33:06.8964303Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:06.8964662Z     
2025-05-07T20:33:06.8964850Z         x_sign = torch.sign(x)
2025-05-07T20:33:06.8965143Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:06.8965456Z         x = x_sign * x_clamp
2025-05-07T20:33:06.8965696Z         x0 = x[:, :D]
2025-05-07T20:33:06.8965905Z         x1 = x[:, D:]
2025-05-07T20:33:06.8966112Z     
2025-05-07T20:33:06.8966294Z         if contiguous:
2025-05-07T20:33:06.8966517Z             x0 = x0.contiguous()
2025-05-07T20:33:06.8966772Z             x1 = x1.contiguous()
2025-05-07T20:33:06.8967014Z     
2025-05-07T20:33:06.8967198Z         if scale_ub is not None:
2025-05-07T20:33:06.8967470Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:06.8967971Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:06.8968281Z             )
2025-05-07T20:33:06.8968476Z         else:
2025-05-07T20:33:06.8968683Z             scale_ub_tensor = None
2025-05-07T20:33:06.8968929Z     
2025-05-07T20:33:06.8969160Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.8969606Z             op = silu_mul_quant
2025-05-07T20:33:06.8969861Z             if compiled:
2025-05-07T20:33:06.8970111Z                 op = torch.compile(op)
2025-05-07T20:33:06.8970422Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:06.8970712Z     
2025-05-07T20:33:06.8970907Z         y_fp8, y_scale = fn()
2025-05-07T20:33:06.8971205Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:06.8971507Z     
2025-05-07T20:33:06.8971743Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:06.8972087Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:06.8972392Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:06.8972722Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:06.8973097Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:06.8973420Z     
2025-05-07T20:33:06.8973622Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:06.8973839Z 
2025-05-07T20:33:06.8973939Z moe/activation_test.py:126: 
2025-05-07T20:33:06.8974250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.8974697Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:06.8975031Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:06.8975858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:06.8976660Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:06.8977228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:06.8977953Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:06.8978677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:06.8979452Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:06.8980220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:06.8980894Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:06.8981529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:06.8982078Z     fn()
2025-05-07T20:33:06.8982608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:06.8983221Z     self.fn.run(
2025-05-07T20:33:06.8983711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:06.8984267Z     kernel = self.compile(
2025-05-07T20:33:06.8984836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:06.8985538Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:06.8991272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:06.8991543Z 
2025-05-07T20:33:06.8991761Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942ffbdd0>
2025-05-07T20:33:06.8992902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:06.8994478Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98578380e0>}
2025-05-07T20:33:06.8995899Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:06.8997061Z context = <triton._C.libtriton.ir.context object at 0x7f9857fffc30>
2025-05-07T20:33:06.8997371Z 
2025-05-07T20:33:06.8997543Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:06.8998093Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:06.8998589Z                            module_map=module_map)
2025-05-07T20:33:06.8998963Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:06.8999333Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:06.8999603Z E       ^
2025-05-07T20:33:06.9000084Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:06.9000562Z 
2025-05-07T20:33:06.9001001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:06.9001568Z 
2025-05-07T20:33:06.9001673Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:06.9002111Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:06.9002526Z     T=1,
2025-05-07T20:33:06.9002711Z     D=5120,
2025-05-07T20:33:06.9002911Z     scale_ub=1200.0,
2025-05-07T20:33:06.9003136Z     contiguous=True,
2025-05-07T20:33:06.9003362Z     compiled=True,
2025-05-07T20:33:06.9003580Z )
2025-05-07T20:33:07.0405003Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.0406083Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.0406644Z 
2025-05-07T20:33:07.0406804Z     @given(
2025-05-07T20:33:07.0407303Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.0407844Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.0408204Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.0408549Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.0408899Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.0409188Z     )
2025-05-07T20:33:07.0409547Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.0410007Z     def test_silu_mul_quant(
2025-05-07T20:33:07.0410242Z         self,
2025-05-07T20:33:07.0410441Z         T: int,
2025-05-07T20:33:07.0410636Z         D: int,
2025-05-07T20:33:07.0410847Z         scale_ub: Optional[float],
2025-05-07T20:33:07.0411124Z         contiguous: bool,
2025-05-07T20:33:07.0411361Z         compiled: bool,
2025-05-07T20:33:07.0411579Z     ) -> None:
2025-05-07T20:33:07.0411804Z         torch.manual_seed(2025)
2025-05-07T20:33:07.0412047Z     
2025-05-07T20:33:07.0412334Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.0412689Z     
2025-05-07T20:33:07.0412884Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.0413174Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.0413488Z         x = x_sign * x_clamp
2025-05-07T20:33:07.0413727Z         x0 = x[:, :D]
2025-05-07T20:33:07.0413942Z         x1 = x[:, D:]
2025-05-07T20:33:07.0414146Z     
2025-05-07T20:33:07.0414381Z         if contiguous:
2025-05-07T20:33:07.0414620Z             x0 = x0.contiguous()
2025-05-07T20:33:07.0414878Z             x1 = x1.contiguous()
2025-05-07T20:33:07.0415125Z     
2025-05-07T20:33:07.0415319Z         if scale_ub is not None:
2025-05-07T20:33:07.0415594Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.0415932Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.0416251Z             )
2025-05-07T20:33:07.0416442Z         else:
2025-05-07T20:33:07.0416831Z             scale_ub_tensor = None
2025-05-07T20:33:07.0417098Z     
2025-05-07T20:33:07.0417332Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.0417653Z             op = silu_mul_quant
2025-05-07T20:33:07.0417908Z             if compiled:
2025-05-07T20:33:07.0418283Z                 op = torch.compile(op)
2025-05-07T20:33:07.0418581Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.0418867Z     
2025-05-07T20:33:07.0419063Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.0419228Z 
2025-05-07T20:33:07.0419325Z moe/activation_test.py:117: 
2025-05-07T20:33:07.0419626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.0419970Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.0420251Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.0420841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.0421449Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.0422144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.0422859Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.0423425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.0424145Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.0424838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.0425544Z     kernel = self.compile(
2025-05-07T20:33:07.0426114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.0426811Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.0427222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.0427476Z 
2025-05-07T20:33:07.0427707Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942ffbbc0>
2025-05-07T20:33:07.0428869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.0430302Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9942710180>}
2025-05-07T20:33:07.0431711Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.0432792Z context = <triton._C.libtriton.ir.context object at 0x7f99432f07f0>
2025-05-07T20:33:07.0433095Z 
2025-05-07T20:33:07.0433268Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.0433810Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.0434287Z                            module_map=module_map)
2025-05-07T20:33:07.0434662Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.0435019Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.0435285Z E       ^
2025-05-07T20:33:07.0435755Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.0436233Z 
﻿2025-05-07T20:33:07.0442156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.0442705Z 
2025-05-07T20:33:07.0442824Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.0443247Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.0443810Z     T=1,
2025-05-07T20:33:07.0444006Z     D=5120,
2025-05-07T20:33:07.0444200Z     scale_ub=None,
2025-05-07T20:33:07.0444416Z     contiguous=False,
2025-05-07T20:33:07.0444649Z     compiled=True,
2025-05-07T20:33:07.0444853Z )
2025-05-07T20:33:07.1058836Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.1059882Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.1060433Z 
2025-05-07T20:33:07.1060589Z     @given(
2025-05-07T20:33:07.1061049Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.1061677Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.1062344Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.1063033Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.1063709Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.1064289Z     )
2025-05-07T20:33:07.1065010Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.1065923Z     def test_silu_mul_quant(
2025-05-07T20:33:07.1066401Z         self,
2025-05-07T20:33:07.1066792Z         T: int,
2025-05-07T20:33:07.1067186Z         D: int,
2025-05-07T20:33:07.1067598Z         scale_ub: Optional[float],
2025-05-07T20:33:07.1067879Z         contiguous: bool,
2025-05-07T20:33:07.1068131Z         compiled: bool,
2025-05-07T20:33:07.1068350Z     ) -> None:
2025-05-07T20:33:07.1068556Z         torch.manual_seed(2025)
2025-05-07T20:33:07.1068800Z     
2025-05-07T20:33:07.1069075Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.1069420Z     
2025-05-07T20:33:07.1069613Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.1069901Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.1070215Z         x = x_sign * x_clamp
2025-05-07T20:33:07.1070448Z         x0 = x[:, :D]
2025-05-07T20:33:07.1070661Z         x1 = x[:, D:]
2025-05-07T20:33:07.1070864Z     
2025-05-07T20:33:07.1071049Z         if contiguous:
2025-05-07T20:33:07.1071279Z             x0 = x0.contiguous()
2025-05-07T20:33:07.1071533Z             x1 = x1.contiguous()
2025-05-07T20:33:07.1071776Z     
2025-05-07T20:33:07.1071970Z         if scale_ub is not None:
2025-05-07T20:33:07.1072246Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.1072583Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.1072901Z             )
2025-05-07T20:33:07.1073093Z         else:
2025-05-07T20:33:07.1073299Z             scale_ub_tensor = None
2025-05-07T20:33:07.1073553Z     
2025-05-07T20:33:07.1073783Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.1074104Z             op = silu_mul_quant
2025-05-07T20:33:07.1074361Z             if compiled:
2025-05-07T20:33:07.1074608Z                 op = torch.compile(op)
2025-05-07T20:33:07.1074907Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.1075187Z     
2025-05-07T20:33:07.1075379Z         y_fp8, y_scale = fn()
2025-05-07T20:33:07.1075658Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:07.1075956Z     
2025-05-07T20:33:07.1076197Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.1076529Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:07.1076832Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:07.1077153Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:07.1077522Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.1077856Z     
2025-05-07T20:33:07.1078084Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:07.1078420Z 
2025-05-07T20:33:07.1078526Z moe/activation_test.py:126: 
2025-05-07T20:33:07.1078823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.1079167Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:07.1079502Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:07.1080473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:07.1081269Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:07.1081839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.1082621Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.1083343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:07.1084100Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:07.1084870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:07.1085544Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:07.1086178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:07.1086729Z     fn()
2025-05-07T20:33:07.1087264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:07.1087887Z     self.fn.run(
2025-05-07T20:33:07.1088405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.1088967Z     kernel = self.compile(
2025-05-07T20:33:07.1089529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.1090211Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.1090625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.1090864Z 
2025-05-07T20:33:07.1091071Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942436510>
2025-05-07T20:33:07.1092207Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.1093634Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f994243b060>}
2025-05-07T20:33:07.1095159Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.1096245Z context = <triton._C.libtriton.ir.context object at 0x7f99432956b0>
2025-05-07T20:33:07.1096541Z 
2025-05-07T20:33:07.1096712Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.1097248Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.1097731Z                            module_map=module_map)
2025-05-07T20:33:07.1098103Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.1098464Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:07.1098732Z E       ^
2025-05-07T20:33:07.1099209Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.1099679Z 
2025-05-07T20:33:07.1100122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.1100664Z 
2025-05-07T20:33:07.1100860Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.1101275Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.1101685Z     T=1,
2025-05-07T20:33:07.1101882Z     D=5120,
2025-05-07T20:33:07.1102073Z     scale_ub=None,
2025-05-07T20:33:07.1102287Z     contiguous=True,
2025-05-07T20:33:07.1102596Z     compiled=False,
2025-05-07T20:33:07.1102793Z )
2025-05-07T20:33:07.2613895Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.2614505Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:07.2614893Z 
2025-05-07T20:33:07.2614986Z     @given(
2025-05-07T20:33:07.2615217Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.2615541Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.2615856Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.2616191Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.2616535Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.2616831Z     )
2025-05-07T20:33:07.2617190Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.2617636Z     def test_silu_mul_quant(
2025-05-07T20:33:07.2617883Z         self,
2025-05-07T20:33:07.2618086Z         T: int,
2025-05-07T20:33:07.2618279Z         D: int,
2025-05-07T20:33:07.2618504Z         scale_ub: Optional[float],
2025-05-07T20:33:07.2618785Z         contiguous: bool,
2025-05-07T20:33:07.2619022Z         compiled: bool,
2025-05-07T20:33:07.2619253Z     ) -> None:
2025-05-07T20:33:07.2619469Z         torch.manual_seed(2025)
2025-05-07T20:33:07.2619710Z     
2025-05-07T20:33:07.2619986Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.2620338Z     
2025-05-07T20:33:07.2620533Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.2620832Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.2621153Z         x = x_sign * x_clamp
2025-05-07T20:33:07.2621389Z         x0 = x[:, :D]
2025-05-07T20:33:07.2621610Z         x1 = x[:, D:]
2025-05-07T20:33:07.2621822Z     
2025-05-07T20:33:07.2622015Z         if contiguous:
2025-05-07T20:33:07.2622246Z             x0 = x0.contiguous()
2025-05-07T20:33:07.2622502Z             x1 = x1.contiguous()
2025-05-07T20:33:07.2622748Z     
2025-05-07T20:33:07.2622940Z         if scale_ub is not None:
2025-05-07T20:33:07.2623217Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.2623560Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.2623877Z             )
2025-05-07T20:33:07.2624069Z         else:
2025-05-07T20:33:07.2624288Z             scale_ub_tensor = None
2025-05-07T20:33:07.2624540Z     
2025-05-07T20:33:07.2624775Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.2625098Z             op = silu_mul_quant
2025-05-07T20:33:07.2625347Z             if compiled:
2025-05-07T20:33:07.2625765Z                 op = torch.compile(op)
2025-05-07T20:33:07.2626081Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.2626358Z     
2025-05-07T20:33:07.2626556Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.2626732Z 
2025-05-07T20:33:07.2626832Z moe/activation_test.py:117: 
2025-05-07T20:33:07.2627141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.2627480Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.2627774Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.2628494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.2629222Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.2629784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.2630505Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.2631285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.2631843Z     kernel = self.compile(
2025-05-07T20:33:07.2632405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.2633208Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.2633617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.2633857Z 
2025-05-07T20:33:07.2634069Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857791dc0>
2025-05-07T20:33:07.2635255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.2636682Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f994243ba60>}
2025-05-07T20:33:07.2638103Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.2639184Z context = <triton._C.libtriton.ir.context object at 0x7f98572ce630>
2025-05-07T20:33:07.2639492Z 
2025-05-07T20:33:07.2639661Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.2640206Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.2640693Z                            module_map=module_map)
2025-05-07T20:33:07.2641061Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.2641423Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.2641692Z E       ^
2025-05-07T20:33:07.2642166Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.2642643Z 
2025-05-07T20:33:07.2643080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.2643626Z 
2025-05-07T20:33:07.2643738Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.2644167Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.2644580Z     T=128,
2025-05-07T20:33:07.2644771Z     D=5120,
2025-05-07T20:33:07.2644967Z     scale_ub=None,
2025-05-07T20:33:07.2645180Z     contiguous=False,
2025-05-07T20:33:07.2645407Z     compiled=True,
2025-05-07T20:33:07.2645613Z )
2025-05-07T20:33:07.2645934Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.2646462Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:07.2646746Z 
2025-05-07T20:33:07.2646828Z     @given(
2025-05-07T20:33:07.2647062Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.2647381Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.2647690Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.2648074Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.2648415Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.2648705Z     )
2025-05-07T20:33:07.2649056Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.2649515Z     def test_silu_mul_quant(
2025-05-07T20:33:07.2649757Z         self,
2025-05-07T20:33:07.2649955Z         T: int,
2025-05-07T20:33:07.2650156Z         D: int,
2025-05-07T20:33:07.2650369Z         scale_ub: Optional[float],
2025-05-07T20:33:07.2650650Z         contiguous: bool,
2025-05-07T20:33:07.2650890Z         compiled: bool,
2025-05-07T20:33:07.2651111Z     ) -> None:
2025-05-07T20:33:07.2651330Z         torch.manual_seed(2025)
2025-05-07T20:33:07.2651639Z     
2025-05-07T20:33:07.2651925Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.2652280Z     
2025-05-07T20:33:07.2652485Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.2652785Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.2653184Z         x = x_sign * x_clamp
2025-05-07T20:33:07.2653436Z         x0 = x[:, :D]
2025-05-07T20:33:07.2653664Z         x1 = x[:, D:]
2025-05-07T20:33:07.2653882Z     
2025-05-07T20:33:07.2654083Z         if contiguous:
2025-05-07T20:33:07.2654327Z             x0 = x0.contiguous()
2025-05-07T20:33:07.2654717Z             x1 = x1.contiguous()
2025-05-07T20:33:07.2654972Z     
2025-05-07T20:33:07.2655173Z         if scale_ub is not None:
2025-05-07T20:33:07.2655446Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.2655790Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.2656112Z             )
2025-05-07T20:33:07.2656302Z         else:
2025-05-07T20:33:07.2656518Z             scale_ub_tensor = None
2025-05-07T20:33:07.2656780Z     
2025-05-07T20:33:07.2657007Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.2657333Z             op = silu_mul_quant
2025-05-07T20:33:07.2657589Z             if compiled:
2025-05-07T20:33:07.2657848Z                 op = torch.compile(op)
2025-05-07T20:33:07.2658146Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.2658480Z     
2025-05-07T20:33:07.2658680Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.2658845Z 
2025-05-07T20:33:07.2658945Z moe/activation_test.py:117: 
2025-05-07T20:33:07.2659248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.2659590Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.2659874Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.2660460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.2661056Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.2661744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.2662462Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.2663028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.2663750Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.2664444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.2665013Z     kernel = self.compile(
2025-05-07T20:33:07.2665579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.2666278Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.2666683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.2666928Z 
2025-05-07T20:33:07.2667137Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942435880>
2025-05-07T20:33:07.2668265Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.2669687Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857f2d1c0>}
2025-05-07T20:33:07.2671094Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.2672171Z context = <triton._C.libtriton.ir.context object at 0x7f985721c430>
2025-05-07T20:33:07.2672534Z 
2025-05-07T20:33:07.2672705Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.2673244Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.2673720Z                            module_map=module_map)
2025-05-07T20:33:07.2674198Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.2674564Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.2674832Z E       ^
2025-05-07T20:33:07.2675308Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.2675827Z 
2025-05-07T20:33:07.2676264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.2676805Z 
2025-05-07T20:33:07.2676917Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.2677340Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.2677777Z     T=128,
2025-05-07T20:33:07.2677998Z     D=7168,
2025-05-07T20:33:07.2678199Z     scale_ub=1200.0,
2025-05-07T20:33:07.2678419Z     contiguous=False,
2025-05-07T20:33:07.2678783Z     compiled=False,
2025-05-07T20:33:07.2679084Z )
2025-05-07T20:33:07.3829442Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.3836331Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:07.3836630Z 
2025-05-07T20:33:07.3836717Z     @given(
2025-05-07T20:33:07.3836950Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.3837286Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.3837610Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.3837970Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.3838335Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.3838642Z     )
2025-05-07T20:33:07.3839000Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.3839468Z     def test_silu_mul_quant(
2025-05-07T20:33:07.3839721Z         self,
2025-05-07T20:33:07.3839917Z         T: int,
2025-05-07T20:33:07.3840116Z         D: int,
2025-05-07T20:33:07.3840338Z         scale_ub: Optional[float],
2025-05-07T20:33:07.3840614Z         contiguous: bool,
2025-05-07T20:33:07.3840865Z         compiled: bool,
2025-05-07T20:33:07.3841099Z     ) -> None:
2025-05-07T20:33:07.3841328Z         torch.manual_seed(2025)
2025-05-07T20:33:07.3841568Z     
2025-05-07T20:33:07.3841858Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.3842221Z     
2025-05-07T20:33:07.3842411Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.3842708Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.3843029Z         x = x_sign * x_clamp
2025-05-07T20:33:07.3843302Z         x0 = x[:, :D]
2025-05-07T20:33:07.3843527Z         x1 = x[:, D:]
2025-05-07T20:33:07.3843739Z     
2025-05-07T20:33:07.3843936Z         if contiguous:
2025-05-07T20:33:07.3844185Z             x0 = x0.contiguous()
2025-05-07T20:33:07.3844447Z             x1 = x1.contiguous()
2025-05-07T20:33:07.3844697Z     
2025-05-07T20:33:07.3844895Z         if scale_ub is not None:
2025-05-07T20:33:07.3845177Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.3845524Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.3845844Z             )
2025-05-07T20:33:07.3846036Z         else:
2025-05-07T20:33:07.3846252Z             scale_ub_tensor = None
2025-05-07T20:33:07.3846512Z     
2025-05-07T20:33:07.3846746Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.3847075Z             op = silu_mul_quant
2025-05-07T20:33:07.3847330Z             if compiled:
2025-05-07T20:33:07.3847577Z                 op = torch.compile(op)
2025-05-07T20:33:07.3847894Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.3848361Z     
2025-05-07T20:33:07.3848565Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.3848733Z 
2025-05-07T20:33:07.3848838Z moe/activation_test.py:117: 
2025-05-07T20:33:07.3849143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.3849488Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.3849890Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.3850619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.3851353Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.3851976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.3852692Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.3853394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.3853969Z     kernel = self.compile(
2025-05-07T20:33:07.3854670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.3855359Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.3855783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.3856016Z 
2025-05-07T20:33:07.3856235Z self = <triton.compiler.compiler.ASTSource object at 0x7f98577c2cc0>
2025-05-07T20:33:07.3857357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.3858792Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857f2cd60>}
2025-05-07T20:33:07.3860207Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.3861295Z context = <triton._C.libtriton.ir.context object at 0x7f9857307cf0>
2025-05-07T20:33:07.3861597Z 
2025-05-07T20:33:07.3861774Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.3862307Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.3862797Z                            module_map=module_map)
2025-05-07T20:33:07.3863168Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.3863531Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.3863799Z E       ^
2025-05-07T20:33:07.3864283Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.3864759Z 
2025-05-07T20:33:07.3865204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.3865750Z 
2025-05-07T20:33:07.3865853Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.3866291Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.3866716Z     T=128,
2025-05-07T20:33:07.3866907Z     D=5120,
2025-05-07T20:33:07.3867106Z     scale_ub=None,
2025-05-07T20:33:07.3867334Z     contiguous=False,
2025-05-07T20:33:07.3867557Z     compiled=False,
2025-05-07T20:33:07.3867774Z )
2025-05-07T20:33:07.3868106Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.3868623Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:07.3868909Z 
2025-05-07T20:33:07.3868993Z     @given(
2025-05-07T20:33:07.3869230Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.3869555Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.3869924Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.3870263Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.3870600Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.3870886Z     )
2025-05-07T20:33:07.3871323Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.3871784Z     def test_silu_mul_quant(
2025-05-07T20:33:07.3872028Z         self,
2025-05-07T20:33:07.3872228Z         T: int,
2025-05-07T20:33:07.3872428Z         D: int,
2025-05-07T20:33:07.3872694Z         scale_ub: Optional[float],
2025-05-07T20:33:07.3872966Z         contiguous: bool,
2025-05-07T20:33:07.3873215Z         compiled: bool,
2025-05-07T20:33:07.3873440Z     ) -> None:
2025-05-07T20:33:07.3873654Z         torch.manual_seed(2025)
2025-05-07T20:33:07.3873896Z     
2025-05-07T20:33:07.3874172Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.3874525Z     
2025-05-07T20:33:07.3874713Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.3875002Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.3875316Z         x = x_sign * x_clamp
2025-05-07T20:33:07.3875551Z         x0 = x[:, :D]
2025-05-07T20:33:07.3875763Z         x1 = x[:, D:]
2025-05-07T20:33:07.3875969Z     
2025-05-07T20:33:07.3876150Z         if contiguous:
2025-05-07T20:33:07.3876376Z             x0 = x0.contiguous()
2025-05-07T20:33:07.3876631Z             x1 = x1.contiguous()
2025-05-07T20:33:07.3876876Z     
2025-05-07T20:33:07.3877073Z         if scale_ub is not None:
2025-05-07T20:33:07.3877342Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.3877683Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.3878024Z             )
2025-05-07T20:33:07.3878245Z         else:
2025-05-07T20:33:07.3878454Z             scale_ub_tensor = None
2025-05-07T20:33:07.3878708Z     
2025-05-07T20:33:07.3878939Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.3879250Z             op = silu_mul_quant
2025-05-07T20:33:07.3879493Z             if compiled:
2025-05-07T20:33:07.3879738Z                 op = torch.compile(op)
2025-05-07T20:33:07.3880026Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.3880304Z     
2025-05-07T20:33:07.3880491Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.3880653Z 
2025-05-07T20:33:07.3880747Z moe/activation_test.py:117: 
2025-05-07T20:33:07.3881032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.3881373Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.3881655Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.3882365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.3883084Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.3883638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.3884348Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.3885039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.3885599Z     kernel = self.compile(
2025-05-07T20:33:07.3886167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.3886847Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.3887255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.3887488Z 
2025-05-07T20:33:07.3887702Z self = <triton.compiler.compiler.ASTSource object at 0x7f98577c3e30>
2025-05-07T20:33:07.3888873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.3890349Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857f2e160>}
2025-05-07T20:33:07.3891832Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.3892983Z context = <triton._C.libtriton.ir.context object at 0x7f98573c5d30>
2025-05-07T20:33:07.3893283Z 
2025-05-07T20:33:07.3893456Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.3893994Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.3894577Z                            module_map=module_map)
2025-05-07T20:33:07.3894955Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.3895326Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.3895592Z E       ^
2025-05-07T20:33:07.3896074Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.3896555Z 
2025-05-07T20:33:07.3897005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.3897550Z 
2025-05-07T20:33:07.3897656Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.3898090Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.3898503Z     T=128,
2025-05-07T20:33:07.3898695Z     D=5120,
2025-05-07T20:33:07.3898882Z     scale_ub=1200.0,
2025-05-07T20:33:07.3899101Z     contiguous=True,
2025-05-07T20:33:07.3899318Z     compiled=False,
2025-05-07T20:33:07.3899523Z )
2025-05-07T20:33:07.5646650Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5647195Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:07.5647487Z 
2025-05-07T20:33:07.5647596Z     @given(
2025-05-07T20:33:07.5647871Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5648209Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5648534Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5648875Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5649213Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5649516Z     )
2025-05-07T20:33:07.5649876Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5650329Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5650574Z         self,
2025-05-07T20:33:07.5650770Z         T: int,
2025-05-07T20:33:07.5650964Z         D: int,
2025-05-07T20:33:07.5651192Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5651471Z         contiguous: bool,
2025-05-07T20:33:07.5651716Z         compiled: bool,
2025-05-07T20:33:07.5651945Z     ) -> None:
2025-05-07T20:33:07.5652164Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5652407Z     
2025-05-07T20:33:07.5652690Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5653045Z     
2025-05-07T20:33:07.5653245Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5653539Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5653855Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5654100Z         x0 = x[:, :D]
2025-05-07T20:33:07.5654317Z         x1 = x[:, D:]
2025-05-07T20:33:07.5654602Z     
2025-05-07T20:33:07.5654789Z         if contiguous:
2025-05-07T20:33:07.5655025Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5655284Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5655531Z     
2025-05-07T20:33:07.5655723Z         if scale_ub is not None:
2025-05-07T20:33:07.5655996Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5656458Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5656776Z             )
2025-05-07T20:33:07.5656971Z         else:
2025-05-07T20:33:07.5657180Z             scale_ub_tensor = None
2025-05-07T20:33:07.5657438Z     
2025-05-07T20:33:07.5657788Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5658116Z             op = silu_mul_quant
2025-05-07T20:33:07.5658377Z             if compiled:
2025-05-07T20:33:07.5658638Z                 op = torch.compile(op)
2025-05-07T20:33:07.5659003Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5659293Z     
2025-05-07T20:33:07.5659488Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5659657Z 
2025-05-07T20:33:07.5659761Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5660070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5660419Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5660717Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5661441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5662179Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5662752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5663467Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5664169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5664738Z     kernel = self.compile(
2025-05-07T20:33:07.5665307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5666001Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5666418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5666666Z 
2025-05-07T20:33:07.5666883Z self = <triton.compiler.compiler.ASTSource object at 0x7f98573b3170>
2025-05-07T20:33:07.5668013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5669490Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857f2f240>}
2025-05-07T20:33:07.5670901Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5671992Z context = <triton._C.libtriton.ir.context object at 0x7f9857bdb630>
2025-05-07T20:33:07.5672296Z 
2025-05-07T20:33:07.5672476Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5673008Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5673496Z                            module_map=module_map)
2025-05-07T20:33:07.5673876Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5674243Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5674508Z E       ^
2025-05-07T20:33:07.5674982Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5675454Z 
2025-05-07T20:33:07.5675897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5676438Z 
2025-05-07T20:33:07.5676546Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5676963Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5677446Z     T=1,
2025-05-07T20:33:07.5677638Z     D=7168,
2025-05-07T20:33:07.5677832Z     scale_ub=1200.0,
2025-05-07T20:33:07.5678065Z     contiguous=True,
2025-05-07T20:33:07.5678292Z     compiled=True,
2025-05-07T20:33:07.5678500Z )
2025-05-07T20:33:07.5678903Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.5679432Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:07.5679706Z 
2025-05-07T20:33:07.5679796Z     @given(
2025-05-07T20:33:07.5680033Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.5680395Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.5680709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.5681049Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.5681385Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.5681682Z     )
2025-05-07T20:33:07.5682048Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.5682501Z     def test_silu_mul_quant(
2025-05-07T20:33:07.5682741Z         self,
2025-05-07T20:33:07.5682932Z         T: int,
2025-05-07T20:33:07.5683126Z         D: int,
2025-05-07T20:33:07.5683348Z         scale_ub: Optional[float],
2025-05-07T20:33:07.5683631Z         contiguous: bool,
2025-05-07T20:33:07.5683873Z         compiled: bool,
2025-05-07T20:33:07.5684090Z     ) -> None:
2025-05-07T20:33:07.5684299Z         torch.manual_seed(2025)
2025-05-07T20:33:07.5684531Z     
2025-05-07T20:33:07.5684806Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.5685156Z     
2025-05-07T20:33:07.5685340Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.5685629Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.5685940Z         x = x_sign * x_clamp
2025-05-07T20:33:07.5686176Z         x0 = x[:, :D]
2025-05-07T20:33:07.5686382Z         x1 = x[:, D:]
2025-05-07T20:33:07.5686587Z     
2025-05-07T20:33:07.5686773Z         if contiguous:
2025-05-07T20:33:07.5686996Z             x0 = x0.contiguous()
2025-05-07T20:33:07.5687255Z             x1 = x1.contiguous()
2025-05-07T20:33:07.5687494Z     
2025-05-07T20:33:07.5687677Z         if scale_ub is not None:
2025-05-07T20:33:07.5687953Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.5688283Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.5688587Z             )
2025-05-07T20:33:07.5688778Z         else:
2025-05-07T20:33:07.5688985Z             scale_ub_tensor = None
2025-05-07T20:33:07.5689237Z     
2025-05-07T20:33:07.5689463Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.5689781Z             op = silu_mul_quant
2025-05-07T20:33:07.5690024Z             if compiled:
2025-05-07T20:33:07.5690269Z                 op = torch.compile(op)
2025-05-07T20:33:07.5690567Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5690843Z     
2025-05-07T20:33:07.5691029Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.5691197Z 
2025-05-07T20:33:07.5691294Z moe/activation_test.py:117: 
2025-05-07T20:33:07.5691591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5691923Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.5692208Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.5692786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.5693360Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.5694046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.5694841Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.5695395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.5696104Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.5696854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.5697408Z     kernel = self.compile(
2025-05-07T20:33:07.5698041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.5698722Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.5699128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.5699399Z 
2025-05-07T20:33:07.5699611Z self = <triton.compiler.compiler.ASTSource object at 0x7f98573b1160>
2025-05-07T20:33:07.5700718Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.5702134Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857368900>}
2025-05-07T20:33:07.5703539Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.5704614Z context = <triton._C.libtriton.ir.context object at 0x7f98574f24b0>
2025-05-07T20:33:07.5704908Z 
2025-05-07T20:33:07.5705080Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.5705611Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.5706090Z                            module_map=module_map)
2025-05-07T20:33:07.5706455Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.5706808Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.5707066Z E       ^
2025-05-07T20:33:07.5707538Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.5708045Z 
2025-05-07T20:33:07.5708504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.5709040Z 
2025-05-07T20:33:07.5709143Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.5709563Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.5709977Z     T=1,
2025-05-07T20:33:07.5710158Z     D=7168,
2025-05-07T20:33:07.5710340Z     scale_ub=1200.0,
2025-05-07T20:33:07.5710556Z     contiguous=False,
2025-05-07T20:33:07.5710776Z     compiled=True,
2025-05-07T20:33:07.5710972Z )
2025-05-07T20:33:07.7132388Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:07.7133002Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:07.7133308Z 
2025-05-07T20:33:07.7133400Z     @given(
2025-05-07T20:33:07.7133633Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:07.7133966Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:07.7134286Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:07.7134767Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:07.7135114Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:07.7135414Z     )
2025-05-07T20:33:07.7135769Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:07.7136243Z     def test_silu_mul_quant(
2025-05-07T20:33:07.7136498Z         self,
2025-05-07T20:33:07.7136702Z         T: int,
2025-05-07T20:33:07.7136902Z         D: int,
2025-05-07T20:33:07.7137128Z         scale_ub: Optional[float],
2025-05-07T20:33:07.7137412Z         contiguous: bool,
2025-05-07T20:33:07.7137656Z         compiled: bool,
2025-05-07T20:33:07.7137913Z     ) -> None:
2025-05-07T20:33:07.7138412Z         torch.manual_seed(2025)
2025-05-07T20:33:07.7138659Z     
2025-05-07T20:33:07.7138942Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:07.7139307Z     
2025-05-07T20:33:07.7139501Z         x_sign = torch.sign(x)
2025-05-07T20:33:07.7139964Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:07.7140319Z         x = x_sign * x_clamp
2025-05-07T20:33:07.7150527Z         x0 = x[:, :D]
2025-05-07T20:33:07.7150773Z         x1 = x[:, D:]
2025-05-07T20:33:07.7151005Z     
2025-05-07T20:33:07.7151342Z         if contiguous:
2025-05-07T20:33:07.7151590Z             x0 = x0.contiguous()
2025-05-07T20:33:07.7151872Z             x1 = x1.contiguous()
2025-05-07T20:33:07.7152133Z     
2025-05-07T20:33:07.7152331Z         if scale_ub is not None:
2025-05-07T20:33:07.7152625Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:07.7152982Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:07.7153315Z             )
2025-05-07T20:33:07.7153518Z         else:
2025-05-07T20:33:07.7153746Z             scale_ub_tensor = None
2025-05-07T20:33:07.7154017Z     
2025-05-07T20:33:07.7154259Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:07.7154606Z             op = silu_mul_quant
2025-05-07T20:33:07.7154882Z             if compiled:
2025-05-07T20:33:07.7155146Z                 op = torch.compile(op)
2025-05-07T20:33:07.7155659Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.7155955Z     
2025-05-07T20:33:07.7156157Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:07.7156345Z 
2025-05-07T20:33:07.7156453Z moe/activation_test.py:117: 
2025-05-07T20:33:07.7156772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.7157134Z moe/activation_test.py:115: in fn
2025-05-07T20:33:07.7157424Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:07.7158063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:07.7158826Z     return fn(*args, **kwargs)
2025-05-07T20:33:07.7159533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:07.7160281Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:07.7160854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:07.7161587Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:07.7162289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:07.7162864Z     kernel = self.compile(
2025-05-07T20:33:07.7163440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:07.7164133Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:07.7164560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:07.7164814Z 
2025-05-07T20:33:07.7165027Z self = <triton.compiler.compiler.ASTSource object at 0x7f98573b2c00>
2025-05-07T20:33:07.7166167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:07.7167607Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857369f80>}
2025-05-07T20:33:07.7169085Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:07.7170183Z context = <triton._C.libtriton.ir.context object at 0x7f9857ba1e30>
2025-05-07T20:33:07.7170567Z 
2025-05-07T20:33:07.7170750Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:07.7171304Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:07.7171880Z                            module_map=module_map)
2025-05-07T20:33:07.7172280Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:07.7172663Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:07.7172941Z E       ^
2025-05-07T20:33:07.7173486Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:07.7173965Z 
2025-05-07T20:33:07.7174530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:07.7175081Z 
2025-05-07T20:33:07.7175206Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:07.7175634Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:07.7176066Z     T=1,
2025-05-07T20:33:07.7176272Z     D=7168,
2025-05-07T20:33:07.7176471Z     scale_ub=None,
2025-05-07T20:33:07.7176712Z     contiguous=False,
2025-05-07T20:33:07.7176954Z     compiled=True,
2025-05-07T20:33:07.7177171Z )
2025-05-07T20:33:08.0045221Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.0045783Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:08.0046062Z 
2025-05-07T20:33:08.0046161Z     @given(
2025-05-07T20:33:08.0046436Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.0046770Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.0047096Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.0047438Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.0047787Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.0048103Z     )
2025-05-07T20:33:08.0048468Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.0048923Z     def test_silu_mul_quant(
2025-05-07T20:33:08.0049177Z         self,
2025-05-07T20:33:08.0049383Z         T: int,
2025-05-07T20:33:08.0049582Z         D: int,
2025-05-07T20:33:08.0049825Z         scale_ub: Optional[float],
2025-05-07T20:33:08.0050106Z         contiguous: bool,
2025-05-07T20:33:08.0050348Z         compiled: bool,
2025-05-07T20:33:08.0050586Z     ) -> None:
2025-05-07T20:33:08.0050810Z         torch.manual_seed(2025)
2025-05-07T20:33:08.0051059Z     
2025-05-07T20:33:08.0051343Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.0051702Z     
2025-05-07T20:33:08.0051900Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.0052198Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.0052516Z         x = x_sign * x_clamp
2025-05-07T20:33:08.0052757Z         x0 = x[:, :D]
2025-05-07T20:33:08.0052983Z         x1 = x[:, D:]
2025-05-07T20:33:08.0053200Z     
2025-05-07T20:33:08.0053396Z         if contiguous:
2025-05-07T20:33:08.0053631Z             x0 = x0.contiguous()
2025-05-07T20:33:08.0053906Z             x1 = x1.contiguous()
2025-05-07T20:33:08.0054152Z     
2025-05-07T20:33:08.0054453Z         if scale_ub is not None:
2025-05-07T20:33:08.0054738Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.0055093Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.0055408Z             )
2025-05-07T20:33:08.0055612Z         else:
2025-05-07T20:33:08.0055833Z             scale_ub_tensor = None
2025-05-07T20:33:08.0056092Z     
2025-05-07T20:33:08.0056333Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.0056661Z             op = silu_mul_quant
2025-05-07T20:33:08.0056913Z             if compiled:
2025-05-07T20:33:08.0057172Z                 op = torch.compile(op)
2025-05-07T20:33:08.0057477Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.0058010Z     
2025-05-07T20:33:08.0058211Z         y_fp8, y_scale = fn()
2025-05-07T20:33:08.0058505Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:08.0058805Z     
2025-05-07T20:33:08.0059038Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.0059548Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:08.0059859Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:08.0060175Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:08.0060549Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.0060958Z     
2025-05-07T20:33:08.0061160Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:08.0061370Z 
2025-05-07T20:33:08.0061471Z moe/activation_test.py:126: 
2025-05-07T20:33:08.0061778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.0062122Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:08.0062461Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:08.0063295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:08.0064095Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:08.0064666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.0065382Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.0066113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:08.0066878Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:08.0067643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:08.0068321Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:08.0068954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:08.0069498Z     fn()
2025-05-07T20:33:08.0070029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:08.0070646Z     self.fn.run(
2025-05-07T20:33:08.0071132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.0071690Z     kernel = self.compile(
2025-05-07T20:33:08.0072257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.0072947Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.0073368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.0073604Z 
2025-05-07T20:33:08.0073819Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857434380>
2025-05-07T20:33:08.0074950Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.0076401Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f985736aca0>}
2025-05-07T20:33:08.0077821Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.0078951Z context = <triton._C.libtriton.ir.context object at 0x7f9857b32430>
2025-05-07T20:33:08.0079263Z 
2025-05-07T20:33:08.0079434Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.0080040Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.0080536Z                            module_map=module_map)
2025-05-07T20:33:08.0080908Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.0081349Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:08.0081632Z E       ^
2025-05-07T20:33:08.0082105Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.0082585Z 
2025-05-07T20:33:08.0083068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.0083619Z 
2025-05-07T20:33:08.0083729Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.0084165Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.0084582Z     T=1,
2025-05-07T20:33:08.0084793Z     D=5120,
2025-05-07T20:33:08.0085007Z     scale_ub=1200.0,
2025-05-07T20:33:08.0085240Z     contiguous=False,
2025-05-07T20:33:08.0085482Z     compiled=True,
2025-05-07T20:33:08.0085699Z )
2025-05-07T20:33:08.1633805Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.1634357Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:08.1634663Z 
2025-05-07T20:33:08.1634754Z     @given(
2025-05-07T20:33:08.1634983Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.1635314Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.1635635Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.1635980Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.1636318Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.1636611Z     )
2025-05-07T20:33:08.1636955Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.1637426Z     def test_silu_mul_quant(
2025-05-07T20:33:08.1637672Z         self,
2025-05-07T20:33:08.1637868Z         T: int,
2025-05-07T20:33:08.1638059Z         D: int,
2025-05-07T20:33:08.1638277Z         scale_ub: Optional[float],
2025-05-07T20:33:08.1638557Z         contiguous: bool,
2025-05-07T20:33:08.1638801Z         compiled: bool,
2025-05-07T20:33:08.1639023Z     ) -> None:
2025-05-07T20:33:08.1639237Z         torch.manual_seed(2025)
2025-05-07T20:33:08.1639474Z     
2025-05-07T20:33:08.1639743Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.1640101Z     
2025-05-07T20:33:08.1640296Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.1640584Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.1640903Z         x = x_sign * x_clamp
2025-05-07T20:33:08.1641143Z         x0 = x[:, :D]
2025-05-07T20:33:08.1641355Z         x1 = x[:, D:]
2025-05-07T20:33:08.1641564Z     
2025-05-07T20:33:08.1641748Z         if contiguous:
2025-05-07T20:33:08.1641973Z             x0 = x0.contiguous()
2025-05-07T20:33:08.1642231Z             x1 = x1.contiguous()
2025-05-07T20:33:08.1642472Z     
2025-05-07T20:33:08.1642655Z         if scale_ub is not None:
2025-05-07T20:33:08.1642928Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.1643271Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.1643585Z             )
2025-05-07T20:33:08.1643772Z         else:
2025-05-07T20:33:08.1643980Z             scale_ub_tensor = None
2025-05-07T20:33:08.1644232Z     
2025-05-07T20:33:08.1644455Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.1644774Z             op = silu_mul_quant
2025-05-07T20:33:08.1645024Z             if compiled:
2025-05-07T20:33:08.1645266Z                 op = torch.compile(op)
2025-05-07T20:33:08.1645564Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.1645845Z     
2025-05-07T20:33:08.1646029Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.1646350Z 
2025-05-07T20:33:08.1646447Z moe/activation_test.py:117: 
2025-05-07T20:33:08.1646744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.1647077Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.1647361Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.1648103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.1648700Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.1649382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.1650182Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.1650745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.1651461Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.1652150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.1652717Z     kernel = self.compile(
2025-05-07T20:33:08.1653285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.1653980Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.1654527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.1654776Z 
2025-05-07T20:33:08.1654987Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857435040>
2025-05-07T20:33:08.1656112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.1657574Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857b74400>}
2025-05-07T20:33:08.1658994Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.1660081Z context = <triton._C.libtriton.ir.context object at 0x7f9857b6d0b0>
2025-05-07T20:33:08.1660389Z 
2025-05-07T20:33:08.1660561Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.1661106Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.1661601Z                            module_map=module_map)
2025-05-07T20:33:08.1661973Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.1662344Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.1662614Z E       ^
2025-05-07T20:33:08.1663096Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.1663578Z 
2025-05-07T20:33:08.1664016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.1664567Z 
2025-05-07T20:33:08.1664679Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.1665111Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.1665526Z     T=1,
2025-05-07T20:33:08.1665717Z     D=5120,
2025-05-07T20:33:08.1665917Z     scale_ub=1200.0,
2025-05-07T20:33:08.1666148Z     contiguous=False,
2025-05-07T20:33:08.1666381Z     compiled=False,
2025-05-07T20:33:08.1666595Z )
2025-05-07T20:33:08.1666919Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.1667428Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:08.1667706Z 
2025-05-07T20:33:08.1667793Z     @given(
2025-05-07T20:33:08.1668106Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.1668429Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.1668745Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.1669083Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.1669488Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.1669785Z     )
2025-05-07T20:33:08.1670147Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.1670601Z     def test_silu_mul_quant(
2025-05-07T20:33:08.1670890Z         self,
2025-05-07T20:33:08.1671110Z         T: int,
2025-05-07T20:33:08.1671379Z         D: int,
2025-05-07T20:33:08.1671654Z         scale_ub: Optional[float],
2025-05-07T20:33:08.1671933Z         contiguous: bool,
2025-05-07T20:33:08.1672167Z         compiled: bool,
2025-05-07T20:33:08.1672390Z     ) -> None:
2025-05-07T20:33:08.1672607Z         torch.manual_seed(2025)
2025-05-07T20:33:08.1672848Z     
2025-05-07T20:33:08.1673122Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.1673477Z     
2025-05-07T20:33:08.1673680Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.1673969Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.1674295Z         x = x_sign * x_clamp
2025-05-07T20:33:08.1674539Z         x0 = x[:, :D]
2025-05-07T20:33:08.1674755Z         x1 = x[:, D:]
2025-05-07T20:33:08.1674964Z     
2025-05-07T20:33:08.1675149Z         if contiguous:
2025-05-07T20:33:08.1675378Z             x0 = x0.contiguous()
2025-05-07T20:33:08.1675643Z             x1 = x1.contiguous()
2025-05-07T20:33:08.1675888Z     
2025-05-07T20:33:08.1676074Z         if scale_ub is not None:
2025-05-07T20:33:08.1676349Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.1676690Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.1677004Z             )
2025-05-07T20:33:08.1677197Z         else:
2025-05-07T20:33:08.1677416Z             scale_ub_tensor = None
2025-05-07T20:33:08.1677675Z     
2025-05-07T20:33:08.1677908Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.1678230Z             op = silu_mul_quant
2025-05-07T20:33:08.1678501Z             if compiled:
2025-05-07T20:33:08.1678779Z                 op = torch.compile(op)
2025-05-07T20:33:08.1679265Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.1679552Z     
2025-05-07T20:33:08.1679738Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.1679910Z 
2025-05-07T20:33:08.1680009Z moe/activation_test.py:117: 
2025-05-07T20:33:08.1680313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.1680649Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.1680935Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.1681663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.1682572Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.1683199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.1683918Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.1684622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.1685174Z     kernel = self.compile(
2025-05-07T20:33:08.1685737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.1686430Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.1686840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.1687076Z 
2025-05-07T20:33:08.1687285Z self = <triton.compiler.compiler.ASTSource object at 0x7f985774bd40>
2025-05-07T20:33:08.1688455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.1690037Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f994223c2c0>}
2025-05-07T20:33:08.1691453Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.1692574Z context = <triton._C.libtriton.ir.context object at 0x7f9942b0fb70>
2025-05-07T20:33:08.1692882Z 
2025-05-07T20:33:08.1693054Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.1693603Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.1694102Z                            module_map=module_map)
2025-05-07T20:33:08.1694558Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.1694928Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.1695204Z E       ^
2025-05-07T20:33:08.1695692Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.1696176Z 
2025-05-07T20:33:08.1696615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.1697173Z 
2025-05-07T20:33:08.1697281Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.1697714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.1698141Z     T=16384,
2025-05-07T20:33:08.1698349Z     D=5120,
2025-05-07T20:33:08.1698551Z     scale_ub=1200.0,
2025-05-07T20:33:08.1698779Z     contiguous=False,
2025-05-07T20:33:08.1699022Z     compiled=True,
2025-05-07T20:33:08.1699235Z )
2025-05-07T20:33:08.2581060Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.2582166Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:08.2582743Z 
2025-05-07T20:33:08.2582908Z     @given(
2025-05-07T20:33:08.2583386Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.2584022Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.2584630Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.2585297Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.2585956Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.2586519Z     )
2025-05-07T20:33:08.2587213Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.2588015Z     def test_silu_mul_quant(
2025-05-07T20:33:08.2588297Z         self,
2025-05-07T20:33:08.2588497Z         T: int,
2025-05-07T20:33:08.2588705Z         D: int,
2025-05-07T20:33:08.2588936Z         scale_ub: Optional[float],
2025-05-07T20:33:08.2589211Z         contiguous: bool,
2025-05-07T20:33:08.2589459Z         compiled: bool,
2025-05-07T20:33:08.2589687Z     ) -> None:
2025-05-07T20:33:08.2589896Z         torch.manual_seed(2025)
2025-05-07T20:33:08.2590147Z     
2025-05-07T20:33:08.2590428Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.2590775Z     
2025-05-07T20:33:08.2590975Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.2591271Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.2591578Z         x = x_sign * x_clamp
2025-05-07T20:33:08.2591820Z         x0 = x[:, :D]
2025-05-07T20:33:08.2592036Z         x1 = x[:, D:]
2025-05-07T20:33:08.2592235Z     
2025-05-07T20:33:08.2592425Z         if contiguous:
2025-05-07T20:33:08.2592660Z             x0 = x0.contiguous()
2025-05-07T20:33:08.2592918Z             x1 = x1.contiguous()
2025-05-07T20:33:08.2593273Z     
2025-05-07T20:33:08.2593470Z         if scale_ub is not None:
2025-05-07T20:33:08.2593746Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.2594075Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.2602866Z             )
2025-05-07T20:33:08.2603102Z         else:
2025-05-07T20:33:08.2603495Z             scale_ub_tensor = None
2025-05-07T20:33:08.2603780Z     
2025-05-07T20:33:08.2604041Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.2604384Z             op = silu_mul_quant
2025-05-07T20:33:08.2604713Z             if compiled:
2025-05-07T20:33:08.2604981Z                 op = torch.compile(op)
2025-05-07T20:33:08.2605303Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.2605610Z     
2025-05-07T20:33:08.2605811Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.2605992Z 
2025-05-07T20:33:08.2606097Z moe/activation_test.py:117: 
2025-05-07T20:33:08.2606418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.2606775Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.2607079Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.2607684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.2608295Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.2609005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.2609749Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.2610337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.2611067Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.2611784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.2612372Z     kernel = self.compile(
2025-05-07T20:33:08.2612958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.2613662Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.2614097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.2614437Z 
2025-05-07T20:33:08.2614670Z self = <triton.compiler.compiler.ASTSource object at 0x7f985774b650>
2025-05-07T20:33:08.2615827Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.2617288Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9942295da0>}
2025-05-07T20:33:08.2618731Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.2619851Z context = <triton._C.libtriton.ir.context object at 0x7f9857618130>
2025-05-07T20:33:08.2620160Z 
2025-05-07T20:33:08.2620351Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.2620900Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.2621405Z                            module_map=module_map)
2025-05-07T20:33:08.2621795Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.2622171Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.2622447Z E       ^
2025-05-07T20:33:08.2622944Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.2623424Z 
2025-05-07T20:33:08.2623958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.2624508Z 
2025-05-07T20:33:08.2624620Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.2625065Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.2625814Z     T=2048,
2025-05-07T20:33:08.2626029Z     D=7168,
2025-05-07T20:33:08.2626255Z     scale_ub=1200.0,
2025-05-07T20:33:08.2626504Z     contiguous=False,
2025-05-07T20:33:08.2626753Z     compiled=True,
2025-05-07T20:33:08.2627041Z )
2025-05-07T20:33:08.2627382Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.2627920Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:08.2628217Z 
2025-05-07T20:33:08.2628311Z     @given(
2025-05-07T20:33:08.2628593Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.2628950Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.2629289Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.2629640Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.2630000Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.2630313Z     )
2025-05-07T20:33:08.2630692Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.2631164Z     def test_silu_mul_quant(
2025-05-07T20:33:08.2631417Z         self,
2025-05-07T20:33:08.2631618Z         T: int,
2025-05-07T20:33:08.2631824Z         D: int,
2025-05-07T20:33:08.2632061Z         scale_ub: Optional[float],
2025-05-07T20:33:08.2632352Z         contiguous: bool,
2025-05-07T20:33:08.2632599Z         compiled: bool,
2025-05-07T20:33:08.2632843Z     ) -> None:
2025-05-07T20:33:08.2633073Z         torch.manual_seed(2025)
2025-05-07T20:33:08.2633311Z     
2025-05-07T20:33:08.2633587Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.2633943Z     
2025-05-07T20:33:08.2634132Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.2634423Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.2634739Z         x = x_sign * x_clamp
2025-05-07T20:33:08.2634974Z         x0 = x[:, :D]
2025-05-07T20:33:08.2635189Z         x1 = x[:, D:]
2025-05-07T20:33:08.2635413Z     
2025-05-07T20:33:08.2635603Z         if contiguous:
2025-05-07T20:33:08.2635852Z             x0 = x0.contiguous()
2025-05-07T20:33:08.2636131Z             x1 = x1.contiguous()
2025-05-07T20:33:08.2636390Z     
2025-05-07T20:33:08.2636588Z         if scale_ub is not None:
2025-05-07T20:33:08.2636879Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.2637234Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.2637558Z             )
2025-05-07T20:33:08.2637773Z         else:
2025-05-07T20:33:08.2638006Z             scale_ub_tensor = None
2025-05-07T20:33:08.2638323Z     
2025-05-07T20:33:08.2638573Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.2638909Z             op = silu_mul_quant
2025-05-07T20:33:08.2639168Z             if compiled:
2025-05-07T20:33:08.2639435Z                 op = torch.compile(op)
2025-05-07T20:33:08.2639751Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.2640039Z     
2025-05-07T20:33:08.2640263Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.2640433Z 
2025-05-07T20:33:08.2640548Z moe/activation_test.py:117: 
2025-05-07T20:33:08.2640865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.2641228Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.2641530Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.2642128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.2642726Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.2643432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.2644254Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.2644842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.2645649Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.2646370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.2646955Z     kernel = self.compile(
2025-05-07T20:33:08.2647577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.2648292Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.2648731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.2648982Z 
2025-05-07T20:33:08.2649212Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857c96d80>
2025-05-07T20:33:08.2650353Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.2651807Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f99422949a0>}
2025-05-07T20:33:08.2653244Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.2654421Z context = <triton._C.libtriton.ir.context object at 0x7f98575f7c70>
2025-05-07T20:33:08.2654729Z 
2025-05-07T20:33:08.2654918Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.2655471Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.2655985Z                            module_map=module_map)
2025-05-07T20:33:08.2656387Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.2656762Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.2657056Z E       ^
2025-05-07T20:33:08.2657557Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.2658037Z 
2025-05-07T20:33:08.2658490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.2659044Z 
2025-05-07T20:33:08.3814952Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.3815444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.3815884Z     T=1,
2025-05-07T20:33:08.3816080Z     D=5120,
2025-05-07T20:33:08.3816271Z     scale_ub=None,
2025-05-07T20:33:08.3816504Z     contiguous=False,
2025-05-07T20:33:08.3816741Z     compiled=False,
2025-05-07T20:33:08.3816943Z )
2025-05-07T20:33:08.3817272Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.3817787Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:08.3818090Z 
2025-05-07T20:33:08.3818179Z     @given(
2025-05-07T20:33:08.3818427Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.3818759Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.3819071Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.3819405Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.3819742Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.3820042Z     )
2025-05-07T20:33:08.3820392Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.3820852Z     def test_silu_mul_quant(
2025-05-07T20:33:08.3821223Z         self,
2025-05-07T20:33:08.3821425Z         T: int,
2025-05-07T20:33:08.3821619Z         D: int,
2025-05-07T20:33:08.3821845Z         scale_ub: Optional[float],
2025-05-07T20:33:08.3822127Z         contiguous: bool,
2025-05-07T20:33:08.3822368Z         compiled: bool,
2025-05-07T20:33:08.3822602Z     ) -> None:
2025-05-07T20:33:08.3822957Z         torch.manual_seed(2025)
2025-05-07T20:33:08.3823199Z     
2025-05-07T20:33:08.3823480Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.3823842Z     
2025-05-07T20:33:08.3824034Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.3824387Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.3824715Z         x = x_sign * x_clamp
2025-05-07T20:33:08.3824963Z         x0 = x[:, :D]
2025-05-07T20:33:08.3825199Z         x1 = x[:, D:]
2025-05-07T20:33:08.3825591Z     
2025-05-07T20:33:08.3825787Z         if contiguous:
2025-05-07T20:33:08.3826036Z             x0 = x0.contiguous()
2025-05-07T20:33:08.3826312Z             x1 = x1.contiguous()
2025-05-07T20:33:08.3826562Z     
2025-05-07T20:33:08.3826769Z         if scale_ub is not None:
2025-05-07T20:33:08.3827056Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.3827411Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.3827736Z             )
2025-05-07T20:33:08.3827937Z         else:
2025-05-07T20:33:08.3828164Z             scale_ub_tensor = None
2025-05-07T20:33:08.3828423Z     
2025-05-07T20:33:08.3828667Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.3829000Z             op = silu_mul_quant
2025-05-07T20:33:08.3829256Z             if compiled:
2025-05-07T20:33:08.3829502Z                 op = torch.compile(op)
2025-05-07T20:33:08.3829813Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.3830086Z     
2025-05-07T20:33:08.3830288Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.3830454Z 
2025-05-07T20:33:08.3830558Z moe/activation_test.py:117: 
2025-05-07T20:33:08.3830858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.3831203Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.3831492Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.3832221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.3832944Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.3833501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.3834218Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.3834912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.3835463Z     kernel = self.compile(
2025-05-07T20:33:08.3836020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.3836707Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.3837122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.3837361Z 
2025-05-07T20:33:08.3837578Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857c94f20>
2025-05-07T20:33:08.3838694Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.3840119Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9942713880>}
2025-05-07T20:33:08.3841523Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.3842684Z context = <triton._C.libtriton.ir.context object at 0x7f9857575230>
2025-05-07T20:33:08.3842981Z 
2025-05-07T20:33:08.3843155Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.3843798Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.3844283Z                            module_map=module_map)
2025-05-07T20:33:08.3844655Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.3845099Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.3845368Z E       ^
2025-05-07T20:33:08.3845844Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.3846314Z 
2025-05-07T20:33:08.3846753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.3847299Z 
2025-05-07T20:33:08.3847401Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.3847821Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.3848236Z     T=4096,
2025-05-07T20:33:08.3848413Z     D=7168,
2025-05-07T20:33:08.3848609Z     scale_ub=1200.0,
2025-05-07T20:33:08.3848842Z     contiguous=False,
2025-05-07T20:33:08.3849068Z     compiled=False,
2025-05-07T20:33:08.3849278Z )
2025-05-07T20:33:08.3849604Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.3850128Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:08.3850416Z 
2025-05-07T20:33:08.3850497Z     @given(
2025-05-07T20:33:08.3850740Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.3851056Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.3851365Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.3851691Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.3852024Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.3852314Z     )
2025-05-07T20:33:08.3852663Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.3853119Z     def test_silu_mul_quant(
2025-05-07T20:33:08.3853364Z         self,
2025-05-07T20:33:08.3853553Z         T: int,
2025-05-07T20:33:08.3853744Z         D: int,
2025-05-07T20:33:08.3853958Z         scale_ub: Optional[float],
2025-05-07T20:33:08.3854230Z         contiguous: bool,
2025-05-07T20:33:08.3854562Z         compiled: bool,
2025-05-07T20:33:08.3854778Z     ) -> None:
2025-05-07T20:33:08.3854988Z         torch.manual_seed(2025)
2025-05-07T20:33:08.3855219Z     
2025-05-07T20:33:08.3855494Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.3855835Z     
2025-05-07T20:33:08.3856022Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.3856310Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.3856623Z         x = x_sign * x_clamp
2025-05-07T20:33:08.3856855Z         x0 = x[:, :D]
2025-05-07T20:33:08.3857065Z         x1 = x[:, D:]
2025-05-07T20:33:08.3857268Z     
2025-05-07T20:33:08.3857446Z         if contiguous:
2025-05-07T20:33:08.3857675Z             x0 = x0.contiguous()
2025-05-07T20:33:08.3857942Z             x1 = x1.contiguous()
2025-05-07T20:33:08.3858175Z     
2025-05-07T20:33:08.3858361Z         if scale_ub is not None:
2025-05-07T20:33:08.3858633Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.3858972Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.3859285Z             )
2025-05-07T20:33:08.3859469Z         else:
2025-05-07T20:33:08.3859683Z             scale_ub_tensor = None
2025-05-07T20:33:08.3859936Z     
2025-05-07T20:33:08.3860167Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.3860485Z             op = silu_mul_quant
2025-05-07T20:33:08.3860728Z             if compiled:
2025-05-07T20:33:08.3861028Z                 op = torch.compile(op)
2025-05-07T20:33:08.3861330Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.3861602Z     
2025-05-07T20:33:08.3861788Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.3861948Z 
2025-05-07T20:33:08.3862051Z moe/activation_test.py:117: 
2025-05-07T20:33:08.3862470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.3862809Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.3863090Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.3863841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.3864558Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.3865115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.3865833Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.3866523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.3867083Z     kernel = self.compile(
2025-05-07T20:33:08.3867649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.3868340Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.3868745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.3868991Z 
2025-05-07T20:33:08.3869200Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857c957c0>
2025-05-07T20:33:08.3870318Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.3871742Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9949700e00>}
2025-05-07T20:33:08.3873149Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.3874224Z context = <triton._C.libtriton.ir.context object at 0x7f98575df8b0>
2025-05-07T20:33:08.3874525Z 
2025-05-07T20:33:08.3874691Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.3875225Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.3875694Z                            module_map=module_map)
2025-05-07T20:33:08.3876060Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.3876418Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.3876686Z E       ^
2025-05-07T20:33:08.3877153Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.3877627Z 
2025-05-07T20:33:08.3878067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.3878631Z 
2025-05-07T20:33:08.3878747Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.3879183Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.3879592Z     T=16384,
2025-05-07T20:33:08.3879786Z     D=7168,
2025-05-07T20:33:08.3879984Z     scale_ub=None,
2025-05-07T20:33:08.3880188Z     contiguous=True,
2025-05-07T20:33:08.3880417Z     compiled=True,
2025-05-07T20:33:08.3880619Z )
2025-05-07T20:33:08.5672761Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5673297Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:08.5673739Z 
2025-05-07T20:33:08.5673827Z     @given(
2025-05-07T20:33:08.5674063Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5674380Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5674688Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5675139Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5675476Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5675785Z     )
2025-05-07T20:33:08.5676132Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5676645Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5676887Z         self,
2025-05-07T20:33:08.5677076Z         T: int,
2025-05-07T20:33:08.5677273Z         D: int,
2025-05-07T20:33:08.5677491Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5677756Z         contiguous: bool,
2025-05-07T20:33:08.5677997Z         compiled: bool,
2025-05-07T20:33:08.5678226Z     ) -> None:
2025-05-07T20:33:08.5678451Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5678694Z     
2025-05-07T20:33:08.5678972Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5679321Z     
2025-05-07T20:33:08.5679519Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5679816Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5680132Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5680368Z         x0 = x[:, :D]
2025-05-07T20:33:08.5680591Z         x1 = x[:, D:]
2025-05-07T20:33:08.5680798Z     
2025-05-07T20:33:08.5680980Z         if contiguous:
2025-05-07T20:33:08.5681216Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5681478Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5681716Z     
2025-05-07T20:33:08.5681910Z         if scale_ub is not None:
2025-05-07T20:33:08.5682188Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5682519Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5682834Z             )
2025-05-07T20:33:08.5683029Z         else:
2025-05-07T20:33:08.5683242Z             scale_ub_tensor = None
2025-05-07T20:33:08.5683500Z     
2025-05-07T20:33:08.5683734Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5684052Z             op = silu_mul_quant
2025-05-07T20:33:08.5684313Z             if compiled:
2025-05-07T20:33:08.5684564Z                 op = torch.compile(op)
2025-05-07T20:33:08.5684866Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5685146Z     
2025-05-07T20:33:08.5685338Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5685505Z 
2025-05-07T20:33:08.5685612Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5685907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5686247Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5686531Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5687109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5687702Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5688389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5689116Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5689679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5690399Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5691100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5691656Z     kernel = self.compile(
2025-05-07T20:33:08.5692216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5692907Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5693371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5693609Z 
2025-05-07T20:33:08.5693818Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857c38890>
2025-05-07T20:33:08.5695160Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5696602Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f994946ae80>}
2025-05-07T20:33:08.5698060Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5699201Z context = <triton._C.libtriton.ir.context object at 0x7f9856e4e270>
2025-05-07T20:33:08.5699504Z 
2025-05-07T20:33:08.5699675Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5700217Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5700714Z                            module_map=module_map)
2025-05-07T20:33:08.5701089Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5701459Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5701731Z E       ^
2025-05-07T20:33:08.5702208Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5702683Z 
2025-05-07T20:33:08.5703119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5703666Z 
2025-05-07T20:33:08.5703772Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.5704201Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.5704612Z     T=4096,
2025-05-07T20:33:08.5704809Z     D=5120,
2025-05-07T20:33:08.5705007Z     scale_ub=None,
2025-05-07T20:33:08.5705223Z     contiguous=False,
2025-05-07T20:33:08.5705440Z     compiled=True,
2025-05-07T20:33:08.5705640Z )
2025-05-07T20:33:08.5705971Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.5706479Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:08.5706768Z 
2025-05-07T20:33:08.5706848Z     @given(
2025-05-07T20:33:08.5712990Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.5713320Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.5713709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.5714173Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.5714614Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.5715000Z     )
2025-05-07T20:33:08.5715439Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.5715908Z     def test_silu_mul_quant(
2025-05-07T20:33:08.5716165Z         self,
2025-05-07T20:33:08.5716364Z         T: int,
2025-05-07T20:33:08.5716566Z         D: int,
2025-05-07T20:33:08.5716793Z         scale_ub: Optional[float],
2025-05-07T20:33:08.5717071Z         contiguous: bool,
2025-05-07T20:33:08.5717317Z         compiled: bool,
2025-05-07T20:33:08.5717549Z     ) -> None:
2025-05-07T20:33:08.5717760Z         torch.manual_seed(2025)
2025-05-07T20:33:08.5718014Z     
2025-05-07T20:33:08.5718279Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.5718637Z     
2025-05-07T20:33:08.5718821Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.5719108Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.5719421Z         x = x_sign * x_clamp
2025-05-07T20:33:08.5719652Z         x0 = x[:, :D]
2025-05-07T20:33:08.5719949Z         x1 = x[:, D:]
2025-05-07T20:33:08.5720157Z     
2025-05-07T20:33:08.5720333Z         if contiguous:
2025-05-07T20:33:08.5720566Z             x0 = x0.contiguous()
2025-05-07T20:33:08.5720829Z             x1 = x1.contiguous()
2025-05-07T20:33:08.5721061Z     
2025-05-07T20:33:08.5721350Z         if scale_ub is not None:
2025-05-07T20:33:08.5721625Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.5721955Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.5722265Z             )
2025-05-07T20:33:08.5722457Z         else:
2025-05-07T20:33:08.5722713Z             scale_ub_tensor = None
2025-05-07T20:33:08.5722971Z     
2025-05-07T20:33:08.5723201Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.5723517Z             op = silu_mul_quant
2025-05-07T20:33:08.5723757Z             if compiled:
2025-05-07T20:33:08.5723994Z                 op = torch.compile(op)
2025-05-07T20:33:08.5724287Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5724560Z     
2025-05-07T20:33:08.5724742Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.5724904Z 
2025-05-07T20:33:08.5725004Z moe/activation_test.py:117: 
2025-05-07T20:33:08.5725296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5725841Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.5726123Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.5726698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.5727284Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.5727963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.5728685Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.5729290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.5729997Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.5730689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.5731244Z     kernel = self.compile(
2025-05-07T20:33:08.5731798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.5732476Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.5732877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.5733116Z 
2025-05-07T20:33:08.5733320Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857c38cb0>
2025-05-07T20:33:08.5734541Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.5735967Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857cb7ba0>}
2025-05-07T20:33:08.5737375Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.5738455Z context = <triton._C.libtriton.ir.context object at 0x7f9856e92270>
2025-05-07T20:33:08.5738752Z 
2025-05-07T20:33:08.5738916Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.5739451Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.5739927Z                            module_map=module_map)
2025-05-07T20:33:08.5740291Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.5740728Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.5740989Z E       ^
2025-05-07T20:33:08.5741456Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.5741923Z 
2025-05-07T20:33:08.5742475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.5743031Z 
2025-05-07T20:33:08.7229713Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.7230161Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.7230754Z     T=4096,
2025-05-07T20:33:08.7230946Z     D=5120,
2025-05-07T20:33:08.7231144Z     scale_ub=1200.0,
2025-05-07T20:33:08.7231369Z     contiguous=False,
2025-05-07T20:33:08.7231598Z     compiled=False,
2025-05-07T20:33:08.7231810Z )
2025-05-07T20:33:08.7232130Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.7232652Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:08.7232939Z 
2025-05-07T20:33:08.7233026Z     @given(
2025-05-07T20:33:08.7233251Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.7233569Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.7233896Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.7234239Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.7234574Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.7234867Z     )
2025-05-07T20:33:08.7235229Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.7235677Z     def test_silu_mul_quant(
2025-05-07T20:33:08.7235913Z         self,
2025-05-07T20:33:08.7236105Z         T: int,
2025-05-07T20:33:08.7236298Z         D: int,
2025-05-07T20:33:08.7236512Z         scale_ub: Optional[float],
2025-05-07T20:33:08.7236779Z         contiguous: bool,
2025-05-07T20:33:08.7237018Z         compiled: bool,
2025-05-07T20:33:08.7237248Z     ) -> None:
2025-05-07T20:33:08.7237470Z         torch.manual_seed(2025)
2025-05-07T20:33:08.7237709Z     
2025-05-07T20:33:08.7237987Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.7238345Z     
2025-05-07T20:33:08.7238544Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.7238852Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.7239167Z         x = x_sign * x_clamp
2025-05-07T20:33:08.7239409Z         x0 = x[:, :D]
2025-05-07T20:33:08.7239624Z         x1 = x[:, D:]
2025-05-07T20:33:08.7239832Z     
2025-05-07T20:33:08.7240020Z         if contiguous:
2025-05-07T20:33:08.7240245Z             x0 = x0.contiguous()
2025-05-07T20:33:08.7240507Z             x1 = x1.contiguous()
2025-05-07T20:33:08.7240748Z     
2025-05-07T20:33:08.7240932Z         if scale_ub is not None:
2025-05-07T20:33:08.7241201Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.7241541Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.7241860Z             )
2025-05-07T20:33:08.7242061Z         else:
2025-05-07T20:33:08.7242273Z             scale_ub_tensor = None
2025-05-07T20:33:08.7242521Z     
2025-05-07T20:33:08.7242758Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.7243087Z             op = silu_mul_quant
2025-05-07T20:33:08.7243342Z             if compiled:
2025-05-07T20:33:08.7243587Z                 op = torch.compile(op)
2025-05-07T20:33:08.7243891Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.7244173Z     
2025-05-07T20:33:08.7244361Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.7244536Z 
2025-05-07T20:33:08.7244634Z moe/activation_test.py:117: 
2025-05-07T20:33:08.7244938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.7245275Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.7245561Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.7246355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.7247076Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.7247755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.7248637Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.7249351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.7249975Z     kernel = self.compile(
2025-05-07T20:33:08.7250547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.7251244Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.7251663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.7251910Z 
2025-05-07T20:33:08.7252122Z self = <triton.compiler.compiler.ASTSource object at 0x7f994224e2d0>
2025-05-07T20:33:08.7253248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.7254760Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f985783a2a0>}
2025-05-07T20:33:08.7256176Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.7257267Z context = <triton._C.libtriton.ir.context object at 0x7f98576533b0>
2025-05-07T20:33:08.7257571Z 
2025-05-07T20:33:08.7257742Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.7258294Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.7258786Z                            module_map=module_map)
2025-05-07T20:33:08.7259156Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.7259525Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.7259792Z E       ^
2025-05-07T20:33:08.7260268Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.7260752Z 
2025-05-07T20:33:08.7261189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.7261738Z 
2025-05-07T20:33:08.7261844Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.7262272Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.7262684Z     T=4096,
2025-05-07T20:33:08.7262882Z     D=5120,
2025-05-07T20:33:08.7263078Z     scale_ub=1200.0,
2025-05-07T20:33:08.7263299Z     contiguous=False,
2025-05-07T20:33:08.7263527Z     compiled=True,
2025-05-07T20:33:08.7263733Z )
2025-05-07T20:33:08.7264057Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.7264578Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:08.7264870Z 
2025-05-07T20:33:08.7264947Z     @given(
2025-05-07T20:33:08.7265181Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.7265505Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.7265822Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.7266161Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.7266492Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.7266787Z     )
2025-05-07T20:33:08.7267146Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.7267658Z     def test_silu_mul_quant(
2025-05-07T20:33:08.7267904Z         self,
2025-05-07T20:33:08.7268105Z         T: int,
2025-05-07T20:33:08.7268333Z         D: int,
2025-05-07T20:33:08.7268575Z         scale_ub: Optional[float],
2025-05-07T20:33:08.7268841Z         contiguous: bool,
2025-05-07T20:33:08.7269154Z         compiled: bool,
2025-05-07T20:33:08.7269377Z     ) -> None:
2025-05-07T20:33:08.7269584Z         torch.manual_seed(2025)
2025-05-07T20:33:08.7269827Z     
2025-05-07T20:33:08.7270100Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.7270486Z     
2025-05-07T20:33:08.7270672Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.7270958Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.7271267Z         x = x_sign * x_clamp
2025-05-07T20:33:08.7271503Z         x0 = x[:, :D]
2025-05-07T20:33:08.7271718Z         x1 = x[:, D:]
2025-05-07T20:33:08.7271914Z     
2025-05-07T20:33:08.7272100Z         if contiguous:
2025-05-07T20:33:08.7272331Z             x0 = x0.contiguous()
2025-05-07T20:33:08.7272586Z             x1 = x1.contiguous()
2025-05-07T20:33:08.7272829Z     
2025-05-07T20:33:08.7273015Z         if scale_ub is not None:
2025-05-07T20:33:08.7273284Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.7273626Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.7273936Z             )
2025-05-07T20:33:08.7274129Z         else:
2025-05-07T20:33:08.7274331Z             scale_ub_tensor = None
2025-05-07T20:33:08.7274583Z     
2025-05-07T20:33:08.7274817Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.7275127Z             op = silu_mul_quant
2025-05-07T20:33:08.7275372Z             if compiled:
2025-05-07T20:33:08.7275616Z                 op = torch.compile(op)
2025-05-07T20:33:08.7275910Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.7276200Z     
2025-05-07T20:33:08.7276386Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.7276552Z 
2025-05-07T20:33:08.7276653Z moe/activation_test.py:117: 
2025-05-07T20:33:08.7276945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.7277285Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.7277567Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.7278144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:08.7278728Z     return fn(*args, **kwargs)
2025-05-07T20:33:08.7279460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.7280185Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.7280738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.7281445Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.7282135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.7282686Z     kernel = self.compile(
2025-05-07T20:33:08.7283240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.7283926Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.7284332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.7284566Z 
2025-05-07T20:33:08.7284773Z self = <triton.compiler.compiler.ASTSource object at 0x7f994224f980>
2025-05-07T20:33:08.7285896Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.7287314Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f985783a520>}
2025-05-07T20:33:08.7288922Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.7290011Z context = <triton._C.libtriton.ir.context object at 0x7f9856fd2d70>
2025-05-07T20:33:08.7290312Z 
2025-05-07T20:33:08.7290482Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.7291065Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.7291550Z                            module_map=module_map)
2025-05-07T20:33:08.7291922Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.7292288Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.7292558Z E       ^
2025-05-07T20:33:08.7293035Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.7293513Z 
2025-05-07T20:33:08.7293951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.7294589Z 
2025-05-07T20:33:08.8458169Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.8459055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.8459952Z     T=2048,
2025-05-07T20:33:08.8460333Z     D=7168,
2025-05-07T20:33:08.8460724Z     scale_ub=1200.0,
2025-05-07T20:33:08.8461168Z     contiguous=False,
2025-05-07T20:33:08.8461621Z     compiled=False,
2025-05-07T20:33:08.8462036Z )
2025-05-07T20:33:08.8462678Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.8463711Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:08.8464289Z 
2025-05-07T20:33:08.8464459Z     @given(
2025-05-07T20:33:08.8464912Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.8465552Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.8466179Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.8466844Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.8467521Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.8468075Z     )
2025-05-07T20:33:08.8468428Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.8468877Z     def test_silu_mul_quant(
2025-05-07T20:33:08.8469120Z         self,
2025-05-07T20:33:08.8469311Z         T: int,
2025-05-07T20:33:08.8469498Z         D: int,
2025-05-07T20:33:08.8469713Z         scale_ub: Optional[float],
2025-05-07T20:33:08.8469985Z         contiguous: bool,
2025-05-07T20:33:08.8470217Z         compiled: bool,
2025-05-07T20:33:08.8470436Z     ) -> None:
2025-05-07T20:33:08.8470643Z         torch.manual_seed(2025)
2025-05-07T20:33:08.8470884Z     
2025-05-07T20:33:08.8471160Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.8471512Z     
2025-05-07T20:33:08.8471697Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.8471988Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.8472308Z         x = x_sign * x_clamp
2025-05-07T20:33:08.8472541Z         x0 = x[:, :D]
2025-05-07T20:33:08.8472751Z         x1 = x[:, D:]
2025-05-07T20:33:08.8472958Z     
2025-05-07T20:33:08.8473137Z         if contiguous:
2025-05-07T20:33:08.8473368Z             x0 = x0.contiguous()
2025-05-07T20:33:08.8473630Z             x1 = x1.contiguous()
2025-05-07T20:33:08.8473876Z     
2025-05-07T20:33:08.8474061Z         if scale_ub is not None:
2025-05-07T20:33:08.8474338Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.8474670Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.8474970Z             )
2025-05-07T20:33:08.8475267Z         else:
2025-05-07T20:33:08.8475473Z             scale_ub_tensor = None
2025-05-07T20:33:08.8475720Z     
2025-05-07T20:33:08.8475952Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.8476263Z             op = silu_mul_quant
2025-05-07T20:33:08.8476505Z             if compiled:
2025-05-07T20:33:08.8476869Z                 op = torch.compile(op)
2025-05-07T20:33:08.8477172Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.8477447Z     
2025-05-07T20:33:08.8477634Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.8477799Z 
2025-05-07T20:33:08.8477957Z moe/activation_test.py:117: 
2025-05-07T20:33:08.8478252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.8478589Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.8478871Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.8479591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.8480315Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.8480869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.8481583Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.8482284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.8482833Z     kernel = self.compile(
2025-05-07T20:33:08.8483389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.8484080Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.8484482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.8484725Z 
2025-05-07T20:33:08.8484936Z self = <triton.compiler.compiler.ASTSource object at 0x7f99422a52b0>
2025-05-07T20:33:08.8486056Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.8487480Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f985783bec0>}
2025-05-07T20:33:08.8488881Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.8489956Z context = <triton._C.libtriton.ir.context object at 0x7f9856f517b0>
2025-05-07T20:33:08.8490257Z 
2025-05-07T20:33:08.8490423Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.8490957Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.8491437Z                            module_map=module_map)
2025-05-07T20:33:08.8491800Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.8492154Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.8492417Z E       ^
2025-05-07T20:33:08.8492890Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.8493365Z 
2025-05-07T20:33:08.8493799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.8494463Z 
2025-05-07T20:33:08.8494563Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.8494979Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.8495381Z     T=1,
2025-05-07T20:33:08.8495562Z     D=7168,
2025-05-07T20:33:08.8495752Z     scale_ub=None,
2025-05-07T20:33:08.8495957Z     contiguous=True,
2025-05-07T20:33:08.8496247Z     compiled=False,
2025-05-07T20:33:08.8496448Z )
2025-05-07T20:33:08.8496761Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:08.8497258Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:08.8497528Z 
2025-05-07T20:33:08.8497680Z     @given(
2025-05-07T20:33:08.8497901Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:08.8498241Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:08.8498576Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:08.8498946Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:08.8499272Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:08.8499561Z     )
2025-05-07T20:33:08.8499911Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:08.8500357Z     def test_silu_mul_quant(
2025-05-07T20:33:08.8500592Z         self,
2025-05-07T20:33:08.8500782Z         T: int,
2025-05-07T20:33:08.8500968Z         D: int,
2025-05-07T20:33:08.8501182Z         scale_ub: Optional[float],
2025-05-07T20:33:08.8501453Z         contiguous: bool,
2025-05-07T20:33:08.8501684Z         compiled: bool,
2025-05-07T20:33:08.8501904Z     ) -> None:
2025-05-07T20:33:08.8502116Z         torch.manual_seed(2025)
2025-05-07T20:33:08.8502355Z     
2025-05-07T20:33:08.8502628Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:08.8502978Z     
2025-05-07T20:33:08.8503169Z         x_sign = torch.sign(x)
2025-05-07T20:33:08.8503456Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:08.8503768Z         x = x_sign * x_clamp
2025-05-07T20:33:08.8504008Z         x0 = x[:, :D]
2025-05-07T20:33:08.8504212Z         x1 = x[:, D:]
2025-05-07T20:33:08.8504414Z     
2025-05-07T20:33:08.8504597Z         if contiguous:
2025-05-07T20:33:08.8504819Z             x0 = x0.contiguous()
2025-05-07T20:33:08.8505075Z             x1 = x1.contiguous()
2025-05-07T20:33:08.8505316Z     
2025-05-07T20:33:08.8505499Z         if scale_ub is not None:
2025-05-07T20:33:08.8505770Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:08.8506101Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:08.8506402Z             )
2025-05-07T20:33:08.8506602Z         else:
2025-05-07T20:33:08.8506821Z             scale_ub_tensor = None
2025-05-07T20:33:08.8507071Z     
2025-05-07T20:33:08.8507313Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:08.8507634Z             op = silu_mul_quant
2025-05-07T20:33:08.8507886Z             if compiled:
2025-05-07T20:33:08.8514509Z                 op = torch.compile(op)
2025-05-07T20:33:08.8514831Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.8515116Z     
2025-05-07T20:33:08.8515318Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:08.8515487Z 
2025-05-07T20:33:08.8515597Z moe/activation_test.py:117: 
2025-05-07T20:33:08.8515901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.8516253Z moe/activation_test.py:115: in fn
2025-05-07T20:33:08.8516538Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:08.8517259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:08.8517985Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:08.8518552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:08.8519272Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:08.8519967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:08.8520525Z     kernel = self.compile(
2025-05-07T20:33:08.8521097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:08.8521864Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:08.8522288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:08.8522533Z 
2025-05-07T20:33:08.8522747Z self = <triton.compiler.compiler.ASTSource object at 0x7f99422a75c0>
2025-05-07T20:33:08.8523955Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:08.8525600Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857c73240>}
2025-05-07T20:33:08.8527016Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:08.8528109Z context = <triton._C.libtriton.ir.context object at 0x7f98576fba30>
2025-05-07T20:33:08.8528410Z 
2025-05-07T20:33:08.8528590Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:08.8529196Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:08.8529682Z                            module_map=module_map)
2025-05-07T20:33:08.8530061Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:08.8530436Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:08.8530712Z E       ^
2025-05-07T20:33:08.8531199Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:08.8531677Z 
2025-05-07T20:33:08.8532122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:08.8532668Z 
2025-05-07T20:33:08.8532785Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:08.8533210Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:08.8533637Z     T=16384,
2025-05-07T20:33:08.8533841Z     D=7168,
2025-05-07T20:33:08.8534040Z     scale_ub=1200.0,
2025-05-07T20:33:08.8534281Z     contiguous=False,
2025-05-07T20:33:08.8534599Z     compiled=True,
2025-05-07T20:33:09.0966620Z )
2025-05-07T20:33:09.0967489Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.0968965Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:09.0969403Z 
2025-05-07T20:33:09.0969510Z     @given(
2025-05-07T20:33:09.0969807Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.0970214Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.0970612Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.0970941Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.0971275Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.0971555Z     )
2025-05-07T20:33:09.0971902Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.0972357Z     def test_silu_mul_quant(
2025-05-07T20:33:09.0972600Z         self,
2025-05-07T20:33:09.0972792Z         T: int,
2025-05-07T20:33:09.0972988Z         D: int,
2025-05-07T20:33:09.0973200Z         scale_ub: Optional[float],
2025-05-07T20:33:09.0973473Z         contiguous: bool,
2025-05-07T20:33:09.0973708Z         compiled: bool,
2025-05-07T20:33:09.0973929Z     ) -> None:
2025-05-07T20:33:09.0974146Z         torch.manual_seed(2025)
2025-05-07T20:33:09.0974492Z     
2025-05-07T20:33:09.0974762Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.0975112Z     
2025-05-07T20:33:09.0975301Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.0975587Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.0976044Z         x = x_sign * x_clamp
2025-05-07T20:33:09.0976277Z         x0 = x[:, :D]
2025-05-07T20:33:09.0976493Z         x1 = x[:, D:]
2025-05-07T20:33:09.0976698Z     
2025-05-07T20:33:09.0976875Z         if contiguous:
2025-05-07T20:33:09.0977107Z             x0 = x0.contiguous()
2025-05-07T20:33:09.0977492Z             x1 = x1.contiguous()
2025-05-07T20:33:09.0977734Z     
2025-05-07T20:33:09.0977924Z         if scale_ub is not None:
2025-05-07T20:33:09.0978193Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.0978528Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.0978958Z             )
2025-05-07T20:33:09.0979149Z         else:
2025-05-07T20:33:09.0979357Z             scale_ub_tensor = None
2025-05-07T20:33:09.0979606Z     
2025-05-07T20:33:09.0979834Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.0980151Z             op = silu_mul_quant
2025-05-07T20:33:09.0980394Z             if compiled:
2025-05-07T20:33:09.0980645Z                 op = torch.compile(op)
2025-05-07T20:33:09.0980942Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.0981214Z     
2025-05-07T20:33:09.0981403Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.0981569Z 
2025-05-07T20:33:09.0981668Z moe/activation_test.py:117: 
2025-05-07T20:33:09.0981964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.0982297Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.0982580Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.0983162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:09.0983744Z     return fn(*args, **kwargs)
2025-05-07T20:33:09.0984427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.0985146Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.0985698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.0986413Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.0987102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.0987658Z     kernel = self.compile(
2025-05-07T20:33:09.0988214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.0988897Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.0989303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.0989539Z 
2025-05-07T20:33:09.0989747Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942782270>
2025-05-07T20:33:09.0990860Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.0992289Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857c71620>}
2025-05-07T20:33:09.0993695Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.0994777Z context = <triton._C.libtriton.ir.context object at 0x7f9857601670>
2025-05-07T20:33:09.0995072Z 
2025-05-07T20:33:09.0995239Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.0995775Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.0996250Z                            module_map=module_map)
2025-05-07T20:33:09.0996674Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.0997029Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.0997292Z E       ^
2025-05-07T20:33:09.0997766Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.0998344Z 
2025-05-07T20:33:09.0998791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.0999383Z 
2025-05-07T20:33:09.0999486Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.0999948Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.1000361Z     T=1,
2025-05-07T20:33:09.1000541Z     D=7168,
2025-05-07T20:33:09.1000739Z     scale_ub=None,
2025-05-07T20:33:09.1000959Z     contiguous=False,
2025-05-07T20:33:09.1001181Z     compiled=False,
2025-05-07T20:33:09.1001384Z )
2025-05-07T20:33:09.1001707Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.1002209Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:09.1002485Z 
2025-05-07T20:33:09.1002564Z     @given(
2025-05-07T20:33:09.1002794Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.1003118Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.1003428Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.1003764Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.1004101Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.1004391Z     )
2025-05-07T20:33:09.1004748Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.1005205Z     def test_silu_mul_quant(
2025-05-07T20:33:09.1005449Z         self,
2025-05-07T20:33:09.1005643Z         T: int,
2025-05-07T20:33:09.1005842Z         D: int,
2025-05-07T20:33:09.1006061Z         scale_ub: Optional[float],
2025-05-07T20:33:09.1006342Z         contiguous: bool,
2025-05-07T20:33:09.1006584Z         compiled: bool,
2025-05-07T20:33:09.1006804Z     ) -> None:
2025-05-07T20:33:09.1007025Z         torch.manual_seed(2025)
2025-05-07T20:33:09.1007266Z     
2025-05-07T20:33:09.1007549Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.1007899Z     
2025-05-07T20:33:09.1008095Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.1008387Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.1008702Z         x = x_sign * x_clamp
2025-05-07T20:33:09.1008992Z         x0 = x[:, :D]
2025-05-07T20:33:09.1009214Z         x1 = x[:, D:]
2025-05-07T20:33:09.1009421Z     
2025-05-07T20:33:09.1009608Z         if contiguous:
2025-05-07T20:33:09.1009845Z             x0 = x0.contiguous()
2025-05-07T20:33:09.1010101Z             x1 = x1.contiguous()
2025-05-07T20:33:09.1010348Z     
2025-05-07T20:33:09.1010545Z         if scale_ub is not None:
2025-05-07T20:33:09.1010821Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.1011164Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.1011476Z             )
2025-05-07T20:33:09.1011669Z         else:
2025-05-07T20:33:09.1011877Z             scale_ub_tensor = None
2025-05-07T20:33:09.1012126Z     
2025-05-07T20:33:09.1012362Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.1012670Z             op = silu_mul_quant
2025-05-07T20:33:09.1012916Z             if compiled:
2025-05-07T20:33:09.1013156Z                 op = torch.compile(op)
2025-05-07T20:33:09.1013452Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.1013726Z     
2025-05-07T20:33:09.1013914Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.1014077Z 
2025-05-07T20:33:09.1014172Z moe/activation_test.py:117: 
2025-05-07T20:33:09.1014544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.1014884Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.1015212Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.1015925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.1016651Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.1017281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.1017997Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.1018698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.1019293Z     kernel = self.compile(
2025-05-07T20:33:09.1019861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.1020554Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.1020961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.1021207Z 
2025-05-07T20:33:09.1021416Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942781610>
2025-05-07T20:33:09.1022544Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.1023968Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857c73ba0>}
2025-05-07T20:33:09.1025374Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.1026790Z context = <triton._C.libtriton.ir.context object at 0x7f98576979f0>
2025-05-07T20:33:09.1027090Z 
2025-05-07T20:33:09.1027260Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.1027790Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.1028263Z                            module_map=module_map)
2025-05-07T20:33:09.1028631Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.1028999Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.1029287Z E       ^
2025-05-07T20:33:09.1029757Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.1030228Z 
2025-05-07T20:33:09.1030667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.1031207Z 
2025-05-07T20:33:09.1031309Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.1031727Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.1032139Z     T=2048,
2025-05-07T20:33:09.1032319Z     D=7168,
2025-05-07T20:33:09.1032511Z     scale_ub=None,
2025-05-07T20:33:09.1032722Z     contiguous=False,
2025-05-07T20:33:09.1032937Z     compiled=True,
2025-05-07T20:33:09.1033129Z )
2025-05-07T20:33:09.1913082Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.1913883Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:09.1914280Z 
2025-05-07T20:33:09.1914399Z     @given(
2025-05-07T20:33:09.1914717Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.1915049Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.1915372Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.1915714Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.1916044Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.1916333Z     )
2025-05-07T20:33:09.1916855Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.1917344Z     def test_silu_mul_quant(
2025-05-07T20:33:09.1917588Z         self,
2025-05-07T20:33:09.1917775Z         T: int,
2025-05-07T20:33:09.1917972Z         D: int,
2025-05-07T20:33:09.1918195Z         scale_ub: Optional[float],
2025-05-07T20:33:09.1918650Z         contiguous: bool,
2025-05-07T20:33:09.1918894Z         compiled: bool,
2025-05-07T20:33:09.1919119Z     ) -> None:
2025-05-07T20:33:09.1919329Z         torch.manual_seed(2025)
2025-05-07T20:33:09.1919574Z     
2025-05-07T20:33:09.1919930Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.1920277Z     
2025-05-07T20:33:09.1920469Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.1920762Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.1921078Z         x = x_sign * x_clamp
2025-05-07T20:33:09.1921323Z         x0 = x[:, :D]
2025-05-07T20:33:09.1921537Z         x1 = x[:, D:]
2025-05-07T20:33:09.1921749Z     
2025-05-07T20:33:09.1921944Z         if contiguous:
2025-05-07T20:33:09.1922189Z             x0 = x0.contiguous()
2025-05-07T20:33:09.1922461Z             x1 = x1.contiguous()
2025-05-07T20:33:09.1922706Z     
2025-05-07T20:33:09.1922904Z         if scale_ub is not None:
2025-05-07T20:33:09.1923194Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.1923540Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.1923862Z             )
2025-05-07T20:33:09.1924067Z         else:
2025-05-07T20:33:09.1924280Z             scale_ub_tensor = None
2025-05-07T20:33:09.1924551Z     
2025-05-07T20:33:09.1924791Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.1925104Z             op = silu_mul_quant
2025-05-07T20:33:09.1925359Z             if compiled:
2025-05-07T20:33:09.1925943Z                 op = torch.compile(op)
2025-05-07T20:33:09.1926243Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.1926527Z     
2025-05-07T20:33:09.1926720Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.1926887Z 
2025-05-07T20:33:09.1926989Z moe/activation_test.py:117: 
2025-05-07T20:33:09.1927284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.1927627Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.1927921Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.1928502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:09.1929095Z     return fn(*args, **kwargs)
2025-05-07T20:33:09.1929785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.1930508Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.1931060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.1931773Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.1932472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.1933023Z     kernel = self.compile(
2025-05-07T20:33:09.1933590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.1934282Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.1934846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.1935087Z 
2025-05-07T20:33:09.1935295Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942782360>
2025-05-07T20:33:09.1936417Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.1937947Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f99436e2ca0>}
2025-05-07T20:33:09.1939507Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.1940607Z context = <triton._C.libtriton.ir.context object at 0x7f99420735b0>
2025-05-07T20:33:09.1940910Z 
2025-05-07T20:33:09.1941081Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.1941688Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.1942180Z                            module_map=module_map)
2025-05-07T20:33:09.1942554Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.1942918Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.1943193Z E       ^
2025-05-07T20:33:09.1943666Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.1944143Z 
2025-05-07T20:33:09.1944588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.1945142Z 
2025-05-07T20:33:09.1945250Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.1945680Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.1946099Z     T=4096,
2025-05-07T20:33:09.1946300Z     D=7168,
2025-05-07T20:33:09.1946500Z     scale_ub=None,
2025-05-07T20:33:09.1946718Z     contiguous=False,
2025-05-07T20:33:09.1946952Z     compiled=True,
2025-05-07T20:33:09.1947160Z )
2025-05-07T20:33:09.1947491Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.1948006Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:09.1948305Z 
2025-05-07T20:33:09.1948385Z     @given(
2025-05-07T20:33:09.1948618Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.1948936Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.1949256Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.1949604Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.1949942Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.1950242Z     )
2025-05-07T20:33:09.1950605Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.1951072Z     def test_silu_mul_quant(
2025-05-07T20:33:09.1951315Z         self,
2025-05-07T20:33:09.1951517Z         T: int,
2025-05-07T20:33:09.1951722Z         D: int,
2025-05-07T20:33:09.1951941Z         scale_ub: Optional[float],
2025-05-07T20:33:09.1952220Z         contiguous: bool,
2025-05-07T20:33:09.1952470Z         compiled: bool,
2025-05-07T20:33:09.1952696Z     ) -> None:
2025-05-07T20:33:09.1952916Z         torch.manual_seed(2025)
2025-05-07T20:33:09.1953161Z     
2025-05-07T20:33:09.1953437Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.1953799Z     
2025-05-07T20:33:09.1954005Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.1954303Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.1954628Z         x = x_sign * x_clamp
2025-05-07T20:33:09.1954876Z         x0 = x[:, :D]
2025-05-07T20:33:09.1955091Z         x1 = x[:, D:]
2025-05-07T20:33:09.1955306Z     
2025-05-07T20:33:09.1955499Z         if contiguous:
2025-05-07T20:33:09.1955731Z             x0 = x0.contiguous()
2025-05-07T20:33:09.1956004Z             x1 = x1.contiguous()
2025-05-07T20:33:09.1956260Z     
2025-05-07T20:33:09.1956464Z         if scale_ub is not None:
2025-05-07T20:33:09.1956739Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.1957086Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.1957458Z             )
2025-05-07T20:33:09.1957653Z         else:
2025-05-07T20:33:09.1957864Z             scale_ub_tensor = None
2025-05-07T20:33:09.1958123Z     
2025-05-07T20:33:09.1958352Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.1958674Z             op = silu_mul_quant
2025-05-07T20:33:09.1959034Z             if compiled:
2025-05-07T20:33:09.1959280Z                 op = torch.compile(op)
2025-05-07T20:33:09.1959587Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.1959865Z     
2025-05-07T20:33:09.1960051Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.1960263Z 
2025-05-07T20:33:09.1960360Z moe/activation_test.py:117: 
2025-05-07T20:33:09.1960660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.1961003Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.1961286Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.1961869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:09.1962464Z     return fn(*args, **kwargs)
2025-05-07T20:33:09.1963148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.1963882Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.1964445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.1965161Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.1965854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.1966414Z     kernel = self.compile(
2025-05-07T20:33:09.1966982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.1967667Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.1968082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.1968327Z 
2025-05-07T20:33:09.1968536Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942782330>
2025-05-07T20:33:09.1969661Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.1971084Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9942a4f240>}
2025-05-07T20:33:09.1972681Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.1973769Z context = <triton._C.libtriton.ir.context object at 0x7f99420d29b0>
2025-05-07T20:33:09.1974073Z 
2025-05-07T20:33:09.1974242Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.1974937Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.1975422Z                            module_map=module_map)
2025-05-07T20:33:09.1975794Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.1976155Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.1976434Z E       ^
2025-05-07T20:33:09.1977002Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.1977483Z 
2025-05-07T20:33:09.1977922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.1978481Z 
2025-05-07T20:33:09.3588843Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.3589680Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.3590391Z     T=16384,
2025-05-07T20:33:09.3590721Z     D=5120,
2025-05-07T20:33:09.3591206Z     scale_ub=1200.0,
2025-05-07T20:33:09.3599176Z     contiguous=False,
2025-05-07T20:33:09.3599418Z     compiled=False,
2025-05-07T20:33:09.3599790Z )
2025-05-07T20:33:09.3600159Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.3600750Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:09.3601143Z 
2025-05-07T20:33:09.3601225Z     @given(
2025-05-07T20:33:09.3601479Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.3601835Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.3602176Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.3602556Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.3602932Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.3603253Z     )
2025-05-07T20:33:09.3603660Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.3604185Z     def test_silu_mul_quant(
2025-05-07T20:33:09.3604454Z         self,
2025-05-07T20:33:09.3604661Z         T: int,
2025-05-07T20:33:09.3604883Z         D: int,
2025-05-07T20:33:09.3605124Z         scale_ub: Optional[float],
2025-05-07T20:33:09.3605423Z         contiguous: bool,
2025-05-07T20:33:09.3605690Z         compiled: bool,
2025-05-07T20:33:09.3605932Z     ) -> None:
2025-05-07T20:33:09.3606147Z         torch.manual_seed(2025)
2025-05-07T20:33:09.3606398Z     
2025-05-07T20:33:09.3606680Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.3607032Z     
2025-05-07T20:33:09.3607231Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.3607530Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.3607842Z         x = x_sign * x_clamp
2025-05-07T20:33:09.3608091Z         x0 = x[:, :D]
2025-05-07T20:33:09.3608310Z         x1 = x[:, D:]
2025-05-07T20:33:09.3608513Z     
2025-05-07T20:33:09.3608699Z         if contiguous:
2025-05-07T20:33:09.3608937Z             x0 = x0.contiguous()
2025-05-07T20:33:09.3609205Z             x1 = x1.contiguous()
2025-05-07T20:33:09.3609444Z     
2025-05-07T20:33:09.3609646Z         if scale_ub is not None:
2025-05-07T20:33:09.3609921Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.3610258Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.3610578Z             )
2025-05-07T20:33:09.3610767Z         else:
2025-05-07T20:33:09.3610976Z             scale_ub_tensor = None
2025-05-07T20:33:09.3611233Z     
2025-05-07T20:33:09.3611472Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.3611788Z             op = silu_mul_quant
2025-05-07T20:33:09.3612046Z             if compiled:
2025-05-07T20:33:09.3612302Z                 op = torch.compile(op)
2025-05-07T20:33:09.3612602Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.3612888Z     
2025-05-07T20:33:09.3613085Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.3613251Z 
2025-05-07T20:33:09.3613351Z moe/activation_test.py:117: 
2025-05-07T20:33:09.3613656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.3614003Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.3614297Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.3615189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.3615925Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.3616490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.3617206Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.3617908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.3618538Z     kernel = self.compile(
2025-05-07T20:33:09.3619112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.3619882Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.3620308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.3620559Z 
2025-05-07T20:33:09.3620776Z self = <triton.compiler.compiler.ASTSource object at 0x7f994825df10>
2025-05-07T20:33:09.3621953Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.3623384Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9942a4e840>}
2025-05-07T20:33:09.3624802Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.3626196Z context = <triton._C.libtriton.ir.context object at 0x7f9857a00e70>
2025-05-07T20:33:09.3626498Z 
2025-05-07T20:33:09.3626678Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.3627222Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.3627706Z                            module_map=module_map)
2025-05-07T20:33:09.3628102Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.3628468Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.3628735Z E       ^
2025-05-07T20:33:09.3629209Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.3629690Z 
2025-05-07T20:33:09.3630126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.3630674Z 
2025-05-07T20:33:09.3630777Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.3631201Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.3631614Z     T=16384,
2025-05-07T20:33:09.3631813Z     D=5120,
2025-05-07T20:33:09.3632008Z     scale_ub=1200.0,
2025-05-07T20:33:09.3632231Z     contiguous=True,
2025-05-07T20:33:09.3632459Z     compiled=True,
2025-05-07T20:33:09.3632664Z )
2025-05-07T20:33:09.3632986Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.3633498Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:09.3633783Z 
2025-05-07T20:33:09.3633869Z     @given(
2025-05-07T20:33:09.3634092Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.3634426Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.3634744Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.3635082Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.3635414Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.3635706Z     )
2025-05-07T20:33:09.3636064Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.3636514Z     def test_silu_mul_quant(
2025-05-07T20:33:09.3636766Z         self,
2025-05-07T20:33:09.3636966Z         T: int,
2025-05-07T20:33:09.3637160Z         D: int,
2025-05-07T20:33:09.3637384Z         scale_ub: Optional[float],
2025-05-07T20:33:09.3637660Z         contiguous: bool,
2025-05-07T20:33:09.3637898Z         compiled: bool,
2025-05-07T20:33:09.3638127Z     ) -> None:
2025-05-07T20:33:09.3638349Z         torch.manual_seed(2025)
2025-05-07T20:33:09.3638590Z     
2025-05-07T20:33:09.3638963Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.3639331Z     
2025-05-07T20:33:09.3639537Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.3639840Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.3640172Z         x = x_sign * x_clamp
2025-05-07T20:33:09.3640544Z         x0 = x[:, :D]
2025-05-07T20:33:09.3640769Z         x1 = x[:, D:]
2025-05-07T20:33:09.3640988Z     
2025-05-07T20:33:09.3641188Z         if contiguous:
2025-05-07T20:33:09.3641422Z             x0 = x0.contiguous()
2025-05-07T20:33:09.3641780Z             x1 = x1.contiguous()
2025-05-07T20:33:09.3642032Z     
2025-05-07T20:33:09.3642226Z         if scale_ub is not None:
2025-05-07T20:33:09.3642512Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.3642857Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.3643169Z             )
2025-05-07T20:33:09.3643372Z         else:
2025-05-07T20:33:09.3643589Z             scale_ub_tensor = None
2025-05-07T20:33:09.3643852Z     
2025-05-07T20:33:09.3644092Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.3644424Z             op = silu_mul_quant
2025-05-07T20:33:09.3644676Z             if compiled:
2025-05-07T20:33:09.3644931Z                 op = torch.compile(op)
2025-05-07T20:33:09.3645245Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.3645534Z     
2025-05-07T20:33:09.3645729Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.3645905Z 
2025-05-07T20:33:09.3646009Z moe/activation_test.py:117: 
2025-05-07T20:33:09.3646315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.3646657Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.3646954Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.3647545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:09.3648135Z     return fn(*args, **kwargs)
2025-05-07T20:33:09.3648837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.3649575Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.3650157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.3650879Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.3651589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.3653652Z     kernel = self.compile(
2025-05-07T20:33:09.3654230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.3655035Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.3655461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.3655705Z 
2025-05-07T20:33:09.3655929Z self = <triton.compiler.compiler.ASTSource object at 0x7f9943d05c40>
2025-05-07T20:33:09.3657068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.3658502Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9943fc6ca0>}
2025-05-07T20:33:09.3659924Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.3661016Z context = <triton._C.libtriton.ir.context object at 0x7f9857ad1230>
2025-05-07T20:33:09.3661317Z 
2025-05-07T20:33:09.3661497Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.3662093Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.3662585Z                            module_map=module_map)
2025-05-07T20:33:09.3663046Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.3663418Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.3663683Z E       ^
2025-05-07T20:33:09.3664170Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.3664685Z 
2025-05-07T20:33:09.3665130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.3665672Z 
2025-05-07T20:33:09.5382476Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.5383830Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.5384755Z     T=16384,
2025-05-07T20:33:09.5385154Z     D=5120,
2025-05-07T20:33:09.5385521Z     scale_ub=None,
2025-05-07T20:33:09.5385947Z     contiguous=False,
2025-05-07T20:33:09.5386400Z     compiled=True,
2025-05-07T20:33:09.5386790Z )
2025-05-07T20:33:09.5387432Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.5388361Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:09.5388696Z 
2025-05-07T20:33:09.5388778Z     @given(
2025-05-07T20:33:09.5389018Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.5389347Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.5389657Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.5390000Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.5390343Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.5390638Z     )
2025-05-07T20:33:09.5390991Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.5391464Z     def test_silu_mul_quant(
2025-05-07T20:33:09.5391717Z         self,
2025-05-07T20:33:09.5391907Z         T: int,
2025-05-07T20:33:09.5392104Z         D: int,
2025-05-07T20:33:09.5392327Z         scale_ub: Optional[float],
2025-05-07T20:33:09.5392606Z         contiguous: bool,
2025-05-07T20:33:09.5392853Z         compiled: bool,
2025-05-07T20:33:09.5393079Z     ) -> None:
2025-05-07T20:33:09.5393298Z         torch.manual_seed(2025)
2025-05-07T20:33:09.5393543Z     
2025-05-07T20:33:09.5393819Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.5394177Z     
2025-05-07T20:33:09.5394375Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.5394660Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.5394980Z         x = x_sign * x_clamp
2025-05-07T20:33:09.5395224Z         x0 = x[:, :D]
2025-05-07T20:33:09.5395436Z         x1 = x[:, D:]
2025-05-07T20:33:09.5395652Z     
2025-05-07T20:33:09.5395843Z         if contiguous:
2025-05-07T20:33:09.5396075Z             x0 = x0.contiguous()
2025-05-07T20:33:09.5396341Z             x1 = x1.contiguous()
2025-05-07T20:33:09.5396589Z     
2025-05-07T20:33:09.5396774Z         if scale_ub is not None:
2025-05-07T20:33:09.5397051Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.5397398Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.5397702Z             )
2025-05-07T20:33:09.5397887Z         else:
2025-05-07T20:33:09.5398102Z             scale_ub_tensor = None
2025-05-07T20:33:09.5398364Z     
2025-05-07T20:33:09.5398593Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.5398917Z             op = silu_mul_quant
2025-05-07T20:33:09.5399170Z             if compiled:
2025-05-07T20:33:09.5399415Z                 op = torch.compile(op)
2025-05-07T20:33:09.5399719Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.5400001Z     
2025-05-07T20:33:09.5400189Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.5400473Z 
2025-05-07T20:33:09.5400575Z moe/activation_test.py:117: 
2025-05-07T20:33:09.5400872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.5401203Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.5401611Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.5402203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:09.5402793Z     return fn(*args, **kwargs)
2025-05-07T20:33:09.5403475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.5404268Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.5404832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.5405546Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.5406250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.5406814Z     kernel = self.compile(
2025-05-07T20:33:09.5407385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.5408077Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.5408495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.5408734Z 
2025-05-07T20:33:09.5408983Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857790c50>
2025-05-07T20:33:09.5410139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.5411651Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857b74b80>}
2025-05-07T20:33:09.5413067Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.5414156Z context = <triton._C.libtriton.ir.context object at 0x7f9857a9bcf0>
2025-05-07T20:33:09.5414592Z 
2025-05-07T20:33:09.5414764Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.5415307Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.5415786Z                            module_map=module_map)
2025-05-07T20:33:09.5416161Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.5416528Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.5416792Z E       ^
2025-05-07T20:33:09.5417270Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.5417744Z 
2025-05-07T20:33:09.5418182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.5418730Z 
2025-05-07T20:33:09.5418838Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.5419253Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.5419668Z     T=2048,
2025-05-07T20:33:09.5419867Z     D=5120,
2025-05-07T20:33:09.5420066Z     scale_ub=None,
2025-05-07T20:33:09.5420283Z     contiguous=False,
2025-05-07T20:33:09.5420514Z     compiled=True,
2025-05-07T20:33:09.5420719Z )
2025-05-07T20:33:09.6346053Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.6346607Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:09.6346906Z 
2025-05-07T20:33:09.6347090Z     @given(
2025-05-07T20:33:09.6347334Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.6347655Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.6347977Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.6348443Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.6348777Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.6349073Z     )
2025-05-07T20:33:09.6349429Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.6349949Z     def test_silu_mul_quant(
2025-05-07T20:33:09.6350189Z         self,
2025-05-07T20:33:09.6350388Z         T: int,
2025-05-07T20:33:09.6350586Z         D: int,
2025-05-07T20:33:09.6350802Z         scale_ub: Optional[float],
2025-05-07T20:33:09.6351078Z         contiguous: bool,
2025-05-07T20:33:09.6351322Z         compiled: bool,
2025-05-07T20:33:09.6351546Z     ) -> None:
2025-05-07T20:33:09.6351767Z         torch.manual_seed(2025)
2025-05-07T20:33:09.6352017Z     
2025-05-07T20:33:09.6352293Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.6352650Z     
2025-05-07T20:33:09.6352845Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.6353136Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.6353461Z         x = x_sign * x_clamp
2025-05-07T20:33:09.6353704Z         x0 = x[:, :D]
2025-05-07T20:33:09.6353922Z         x1 = x[:, D:]
2025-05-07T20:33:09.6354134Z     
2025-05-07T20:33:09.6354323Z         if contiguous:
2025-05-07T20:33:09.6354563Z             x0 = x0.contiguous()
2025-05-07T20:33:09.6354822Z             x1 = x1.contiguous()
2025-05-07T20:33:09.6355068Z     
2025-05-07T20:33:09.6355266Z         if scale_ub is not None:
2025-05-07T20:33:09.6355538Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.6355878Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.6356196Z             )
2025-05-07T20:33:09.6356392Z         else:
2025-05-07T20:33:09.6356604Z             scale_ub_tensor = None
2025-05-07T20:33:09.6356864Z     
2025-05-07T20:33:09.6357094Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.6357443Z             op = silu_mul_quant
2025-05-07T20:33:09.6357698Z             if compiled:
2025-05-07T20:33:09.6357958Z                 op = torch.compile(op)
2025-05-07T20:33:09.6358256Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.6358539Z     
2025-05-07T20:33:09.6358731Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.6358896Z 
2025-05-07T20:33:09.6358997Z moe/activation_test.py:117: 
2025-05-07T20:33:09.6359305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.6359657Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.6359950Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.6360535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:09.6361128Z     return fn(*args, **kwargs)
2025-05-07T20:33:09.6361821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.6362541Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.6363106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.6363825Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.6364523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.6365081Z     kernel = self.compile(
2025-05-07T20:33:09.6365649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.6366341Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.6366754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.6367046Z 
2025-05-07T20:33:09.6367255Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857791220>
2025-05-07T20:33:09.6368486Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.6369916Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857b760c0>}
2025-05-07T20:33:09.6371375Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.6372463Z context = <triton._C.libtriton.ir.context object at 0x7f9857158fb0>
2025-05-07T20:33:09.6372775Z 
2025-05-07T20:33:09.6372949Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.6373494Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.6373986Z                            module_map=module_map)
2025-05-07T20:33:09.6374445Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.6374815Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.6375093Z E       ^
2025-05-07T20:33:09.6375575Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.6376057Z 
2025-05-07T20:33:09.6376500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.6377049Z 
2025-05-07T20:33:09.6377158Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.6377589Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.6378016Z     T=2048,
2025-05-07T20:33:09.6378215Z     D=5120,
2025-05-07T20:33:09.6378409Z     scale_ub=1200.0,
2025-05-07T20:33:09.6378629Z     contiguous=False,
2025-05-07T20:33:09.6378858Z     compiled=True,
2025-05-07T20:33:09.6379064Z )
2025-05-07T20:33:09.6379391Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.6379909Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:09.6380202Z 
2025-05-07T20:33:09.6380279Z     @given(
2025-05-07T20:33:09.6380521Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.6380836Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.6381151Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.6381488Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.6381822Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.6382120Z     )
2025-05-07T20:33:09.6382485Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.6382938Z     def test_silu_mul_quant(
2025-05-07T20:33:09.6383190Z         self,
2025-05-07T20:33:09.6383386Z         T: int,
2025-05-07T20:33:09.6383584Z         D: int,
2025-05-07T20:33:09.6383808Z         scale_ub: Optional[float],
2025-05-07T20:33:09.6384089Z         contiguous: bool,
2025-05-07T20:33:09.6384334Z         compiled: bool,
2025-05-07T20:33:09.6384554Z     ) -> None:
2025-05-07T20:33:09.6384773Z         torch.manual_seed(2025)
2025-05-07T20:33:09.6385019Z     
2025-05-07T20:33:09.6385294Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.6385648Z     
2025-05-07T20:33:09.6385841Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.6386129Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.6386447Z         x = x_sign * x_clamp
2025-05-07T20:33:09.6386693Z         x0 = x[:, :D]
2025-05-07T20:33:09.6386977Z         x1 = x[:, D:]
2025-05-07T20:33:09.6387184Z     
2025-05-07T20:33:09.6387374Z         if contiguous:
2025-05-07T20:33:09.6387614Z             x0 = x0.contiguous()
2025-05-07T20:33:09.6387872Z             x1 = x1.contiguous()
2025-05-07T20:33:09.6388122Z     
2025-05-07T20:33:09.6388391Z         if scale_ub is not None:
2025-05-07T20:33:09.6388845Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.6395984Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.6396314Z             )
2025-05-07T20:33:09.6396514Z         else:
2025-05-07T20:33:09.6396845Z             scale_ub_tensor = None
2025-05-07T20:33:09.6397104Z     
2025-05-07T20:33:09.6397347Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.6397674Z             op = silu_mul_quant
2025-05-07T20:33:09.6397923Z             if compiled:
2025-05-07T20:33:09.6398173Z                 op = torch.compile(op)
2025-05-07T20:33:09.6398477Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.6398764Z     
2025-05-07T20:33:09.6398960Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.6399130Z 
2025-05-07T20:33:09.6399237Z moe/activation_test.py:117: 
2025-05-07T20:33:09.6399541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.6399884Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.6400174Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.6400758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:09.6401337Z     return fn(*args, **kwargs)
2025-05-07T20:33:09.6402028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.6402749Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.6403308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.6404017Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.6404713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.6405271Z     kernel = self.compile(
2025-05-07T20:33:09.6405837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.6406525Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.6406933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.6407172Z 
2025-05-07T20:33:09.6407388Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942437f20>
2025-05-07T20:33:09.6408506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.6409945Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857b772e0>}
2025-05-07T20:33:09.6411367Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.6412450Z context = <triton._C.libtriton.ir.context object at 0x7f98571dfcb0>
2025-05-07T20:33:09.6412749Z 
2025-05-07T20:33:09.6412928Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.6413460Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.6413941Z                            module_map=module_map)
2025-05-07T20:33:09.6414316Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.6414764Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.6415101Z E       ^
2025-05-07T20:33:09.6415579Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.6416052Z 
2025-05-07T20:33:09.6416572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.6417117Z 
2025-05-07T20:33:09.8174323Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.8174800Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.8175331Z     T=4096,
2025-05-07T20:33:09.8175525Z     D=5120,
2025-05-07T20:33:09.8175714Z     scale_ub=1200.0,
2025-05-07T20:33:09.8175931Z     contiguous=True,
2025-05-07T20:33:09.8176148Z     compiled=True,
2025-05-07T20:33:09.8176356Z )
2025-05-07T20:33:09.8176675Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:09.8177187Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:09.8177480Z 
2025-05-07T20:33:09.8177581Z     @given(
2025-05-07T20:33:09.8177810Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:09.8178128Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:09.8178435Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:09.8178782Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:09.8179115Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:09.8179397Z     )
2025-05-07T20:33:09.8179746Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:09.8180208Z     def test_silu_mul_quant(
2025-05-07T20:33:09.8180446Z         self,
2025-05-07T20:33:09.8180644Z         T: int,
2025-05-07T20:33:09.8180843Z         D: int,
2025-05-07T20:33:09.8181055Z         scale_ub: Optional[float],
2025-05-07T20:33:09.8181328Z         contiguous: bool,
2025-05-07T20:33:09.8181568Z         compiled: bool,
2025-05-07T20:33:09.8181793Z     ) -> None:
2025-05-07T20:33:09.8182004Z         torch.manual_seed(2025)
2025-05-07T20:33:09.8182248Z     
2025-05-07T20:33:09.8182524Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:09.8182867Z     
2025-05-07T20:33:09.8183057Z         x_sign = torch.sign(x)
2025-05-07T20:33:09.8183352Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:09.8183658Z         x = x_sign * x_clamp
2025-05-07T20:33:09.8183899Z         x0 = x[:, :D]
2025-05-07T20:33:09.8184122Z         x1 = x[:, D:]
2025-05-07T20:33:09.8184319Z     
2025-05-07T20:33:09.8184511Z         if contiguous:
2025-05-07T20:33:09.8184740Z             x0 = x0.contiguous()
2025-05-07T20:33:09.8184992Z             x1 = x1.contiguous()
2025-05-07T20:33:09.8185235Z     
2025-05-07T20:33:09.8185422Z         if scale_ub is not None:
2025-05-07T20:33:09.8185690Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:09.8186028Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:09.8186344Z             )
2025-05-07T20:33:09.8186536Z         else:
2025-05-07T20:33:09.8186740Z             scale_ub_tensor = None
2025-05-07T20:33:09.8186998Z     
2025-05-07T20:33:09.8187230Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:09.8187547Z             op = silu_mul_quant
2025-05-07T20:33:09.8187801Z             if compiled:
2025-05-07T20:33:09.8188049Z                 op = torch.compile(op)
2025-05-07T20:33:09.8188341Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.8188616Z     
2025-05-07T20:33:09.8188808Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:09.8188972Z 
2025-05-07T20:33:09.8189069Z moe/activation_test.py:117: 
2025-05-07T20:33:09.8189367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.8189705Z moe/activation_test.py:115: in fn
2025-05-07T20:33:09.8189979Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:09.8190559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:09.8191225Z     return fn(*args, **kwargs)
2025-05-07T20:33:09.8191912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:09.8192746Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:09.8193314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:09.8194033Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:09.8194775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:09.8195332Z     kernel = self.compile(
2025-05-07T20:33:09.8195900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:09.8196597Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:09.8197002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:09.8197247Z 
2025-05-07T20:33:09.8197455Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942b877d0>
2025-05-07T20:33:09.8198580Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:09.8200003Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98570dc860>}
2025-05-07T20:33:09.8201412Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:09.8202494Z context = <triton._C.libtriton.ir.context object at 0x7f9857034770>
2025-05-07T20:33:09.8202804Z 
2025-05-07T20:33:09.8202976Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:09.8203520Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:09.8204014Z                            module_map=module_map)
2025-05-07T20:33:09.8204385Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:09.8204747Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:09.8205017Z E       ^
2025-05-07T20:33:09.8205499Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:09.8205980Z 
2025-05-07T20:33:09.8206419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:09.8206967Z 
2025-05-07T20:33:09.8207072Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:09.8207503Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:09.8207918Z     T=128,
2025-05-07T20:33:09.8208112Z     D=5120,
2025-05-07T20:33:09.8208302Z     scale_ub=1200.0,
2025-05-07T20:33:09.8208519Z     contiguous=False,
2025-05-07T20:33:09.8208747Z     compiled=True,
2025-05-07T20:33:09.8208951Z )
2025-05-07T20:33:10.1359934Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1360513Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:10.1360805Z 
2025-05-07T20:33:10.1360899Z     @given(
2025-05-07T20:33:10.1361148Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1361601Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1361965Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1362306Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1362647Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1363051Z     )
2025-05-07T20:33:10.1363398Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1363853Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1364098Z         self,
2025-05-07T20:33:10.1364285Z         T: int,
2025-05-07T20:33:10.1364626Z         D: int,
2025-05-07T20:33:10.1364845Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1365114Z         contiguous: bool,
2025-05-07T20:33:10.1365357Z         compiled: bool,
2025-05-07T20:33:10.1365585Z     ) -> None:
2025-05-07T20:33:10.1365875Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1366128Z     
2025-05-07T20:33:10.1366417Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1366773Z     
2025-05-07T20:33:10.1366973Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1367279Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1367609Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1367867Z         x0 = x[:, :D]
2025-05-07T20:33:10.1368094Z         x1 = x[:, D:]
2025-05-07T20:33:10.1368306Z     
2025-05-07T20:33:10.1368518Z         if contiguous:
2025-05-07T20:33:10.1368788Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1369061Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1369308Z     
2025-05-07T20:33:10.1369518Z         if scale_ub is not None:
2025-05-07T20:33:10.1369806Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1370148Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1370473Z             )
2025-05-07T20:33:10.1370679Z         else:
2025-05-07T20:33:10.1370899Z             scale_ub_tensor = None
2025-05-07T20:33:10.1371157Z     
2025-05-07T20:33:10.1371398Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1371724Z             op = silu_mul_quant
2025-05-07T20:33:10.1371989Z             if compiled:
2025-05-07T20:33:10.1372244Z                 op = torch.compile(op)
2025-05-07T20:33:10.1372552Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1372838Z     
2025-05-07T20:33:10.1373035Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1373207Z 
2025-05-07T20:33:10.1373310Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1373618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1373969Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1374258Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1374953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1375551Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1376248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1376971Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1377534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1378256Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1378955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1379516Z     kernel = self.compile(
2025-05-07T20:33:10.1380090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1380786Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1381198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1381442Z 
2025-05-07T20:33:10.1381654Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942200fe0>
2025-05-07T20:33:10.1382803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1384291Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98570dd580>}
2025-05-07T20:33:10.1385810Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1386908Z context = <triton._C.libtriton.ir.context object at 0x7f98570f82b0>
2025-05-07T20:33:10.1387254Z 
2025-05-07T20:33:10.1387426Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1387971Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1388458Z                            module_map=module_map)
2025-05-07T20:33:10.1388839Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1389202Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1389473Z E       ^
2025-05-07T20:33:10.1389959Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1390433Z 
2025-05-07T20:33:10.1390878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1391426Z 
2025-05-07T20:33:10.1391538Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.1391966Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.1392387Z     T=16384,
2025-05-07T20:33:10.1392571Z     D=7168,
2025-05-07T20:33:10.1392764Z     scale_ub=1200.0,
2025-05-07T20:33:10.1392980Z     contiguous=True,
2025-05-07T20:33:10.1393196Z     compiled=True,
2025-05-07T20:33:10.1393398Z )
2025-05-07T20:33:10.1393721Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.1394230Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:10.1394520Z 
2025-05-07T20:33:10.1394594Z     @given(
2025-05-07T20:33:10.1394825Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.1395135Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.1395446Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.1395776Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.1396103Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.1396388Z     )
2025-05-07T20:33:10.1396738Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.1397192Z     def test_silu_mul_quant(
2025-05-07T20:33:10.1397428Z         self,
2025-05-07T20:33:10.1397621Z         T: int,
2025-05-07T20:33:10.1397814Z         D: int,
2025-05-07T20:33:10.1398024Z         scale_ub: Optional[float],
2025-05-07T20:33:10.1398298Z         contiguous: bool,
2025-05-07T20:33:10.1398542Z         compiled: bool,
2025-05-07T20:33:10.1398757Z     ) -> None:
2025-05-07T20:33:10.1398966Z         torch.manual_seed(2025)
2025-05-07T20:33:10.1399208Z     
2025-05-07T20:33:10.1399477Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.1399830Z     
2025-05-07T20:33:10.1400033Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.1400317Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.1400629Z         x = x_sign * x_clamp
2025-05-07T20:33:10.1400866Z         x0 = x[:, :D]
2025-05-07T20:33:10.1401084Z         x1 = x[:, D:]
2025-05-07T20:33:10.1401282Z     
2025-05-07T20:33:10.1401466Z         if contiguous:
2025-05-07T20:33:10.1401696Z             x0 = x0.contiguous()
2025-05-07T20:33:10.1401949Z             x1 = x1.contiguous()
2025-05-07T20:33:10.1402189Z     
2025-05-07T20:33:10.1402378Z         if scale_ub is not None:
2025-05-07T20:33:10.1402646Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.1403424Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.1403732Z             )
2025-05-07T20:33:10.1403922Z         else:
2025-05-07T20:33:10.1404128Z             scale_ub_tensor = None
2025-05-07T20:33:10.1404382Z     
2025-05-07T20:33:10.1404691Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.1405031Z             op = silu_mul_quant
2025-05-07T20:33:10.1405287Z             if compiled:
2025-05-07T20:33:10.1405533Z                 op = torch.compile(op)
2025-05-07T20:33:10.1405839Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1406166Z     
2025-05-07T20:33:10.1406365Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.1406531Z 
2025-05-07T20:33:10.1406630Z moe/activation_test.py:117: 
2025-05-07T20:33:10.1406930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1407272Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.1407558Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.1408147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.1408788Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.1409477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.1410206Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.1410769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.1411491Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.1412190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.1412753Z     kernel = self.compile(
2025-05-07T20:33:10.1413322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.1414018Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.1414492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.1414735Z 
2025-05-07T20:33:10.1414952Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942201dc0>
2025-05-07T20:33:10.1416082Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.1417506Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98570de0c0>}
2025-05-07T20:33:10.1418910Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.1420001Z context = <triton._C.libtriton.ir.context object at 0x7f98570f39f0>
2025-05-07T20:33:10.1420303Z 
2025-05-07T20:33:10.1420474Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.1421021Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.1421502Z                            module_map=module_map)
2025-05-07T20:33:10.1421875Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.1422239Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.1422503Z E       ^
2025-05-07T20:33:10.1422985Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.1423462Z 
2025-05-07T20:33:10.1423900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.1424496Z 
2025-05-07T20:33:10.2660339Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2660803Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2661237Z     T=16384,
2025-05-07T20:33:10.2661526Z     D=5120,
2025-05-07T20:33:10.2661791Z     scale_ub=1200.0,
2025-05-07T20:33:10.2662259Z     contiguous=True,
2025-05-07T20:33:10.2662554Z     compiled=False,
2025-05-07T20:33:10.2662819Z )
2025-05-07T20:33:10.2663143Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2663650Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:10.2664011Z 
2025-05-07T20:33:10.2664088Z     @given(
2025-05-07T20:33:10.2664313Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2664626Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2664931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2665262Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2665591Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2665884Z     )
2025-05-07T20:33:10.2666232Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2666683Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2666928Z         self,
2025-05-07T20:33:10.2667122Z         T: int,
2025-05-07T20:33:10.2667310Z         D: int,
2025-05-07T20:33:10.2667530Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2667805Z         contiguous: bool,
2025-05-07T20:33:10.2668043Z         compiled: bool,
2025-05-07T20:33:10.2668258Z     ) -> None:
2025-05-07T20:33:10.2668467Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2668705Z     
2025-05-07T20:33:10.2668972Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2669323Z     
2025-05-07T20:33:10.2669515Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.2669799Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.2670113Z         x = x_sign * x_clamp
2025-05-07T20:33:10.2670354Z         x0 = x[:, :D]
2025-05-07T20:33:10.2670562Z         x1 = x[:, D:]
2025-05-07T20:33:10.2670767Z     
2025-05-07T20:33:10.2670948Z         if contiguous:
2025-05-07T20:33:10.2671175Z             x0 = x0.contiguous()
2025-05-07T20:33:10.2671440Z             x1 = x1.contiguous()
2025-05-07T20:33:10.2671680Z     
2025-05-07T20:33:10.2671862Z         if scale_ub is not None:
2025-05-07T20:33:10.2672132Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.2672466Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.2672776Z             )
2025-05-07T20:33:10.2672990Z         else:
2025-05-07T20:33:10.2673200Z             scale_ub_tensor = None
2025-05-07T20:33:10.2673455Z     
2025-05-07T20:33:10.2673685Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.2673995Z             op = silu_mul_quant
2025-05-07T20:33:10.2674247Z             if compiled:
2025-05-07T20:33:10.2674494Z                 op = torch.compile(op)
2025-05-07T20:33:10.2674787Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.2675064Z     
2025-05-07T20:33:10.2675252Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.2675414Z 
2025-05-07T20:33:10.2675509Z moe/activation_test.py:117: 
2025-05-07T20:33:10.2675809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.2676147Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.2676422Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.2677141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.2677867Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.2678418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.2679126Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.2679893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.2680451Z     kernel = self.compile(
2025-05-07T20:33:10.2681090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.2681778Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.2682192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.2682466Z 
2025-05-07T20:33:10.2682680Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942200b30>
2025-05-07T20:33:10.2683799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.2685222Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98570df1a0>}
2025-05-07T20:33:10.2686636Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.2687729Z context = <triton._C.libtriton.ir.context object at 0x7f98569e80f0>
2025-05-07T20:33:10.2688029Z 
2025-05-07T20:33:10.2688202Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.2688738Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.2689228Z                            module_map=module_map)
2025-05-07T20:33:10.2689603Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.2689964Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.2696984Z E       ^
2025-05-07T20:33:10.2697612Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.2698095Z 
2025-05-07T20:33:10.2698537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.2699121Z 
2025-05-07T20:33:10.2699250Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.2699680Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.2700096Z     T=1,
2025-05-07T20:33:10.2700290Z     D=7168,
2025-05-07T20:33:10.2700483Z     scale_ub=1200.0,
2025-05-07T20:33:10.2700702Z     contiguous=False,
2025-05-07T20:33:10.2700927Z     compiled=False,
2025-05-07T20:33:10.2701129Z )
2025-05-07T20:33:10.2701442Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.2701950Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:10.2702231Z 
2025-05-07T20:33:10.2702319Z     @given(
2025-05-07T20:33:10.2702548Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.2702862Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.2703178Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.2703517Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.2703844Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.2704131Z     )
2025-05-07T20:33:10.2704485Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.2704938Z     def test_silu_mul_quant(
2025-05-07T20:33:10.2705180Z         self,
2025-05-07T20:33:10.2705371Z         T: int,
2025-05-07T20:33:10.2705558Z         D: int,
2025-05-07T20:33:10.2705773Z         scale_ub: Optional[float],
2025-05-07T20:33:10.2706045Z         contiguous: bool,
2025-05-07T20:33:10.2706276Z         compiled: bool,
2025-05-07T20:33:10.2706495Z     ) -> None:
2025-05-07T20:33:10.2706792Z         torch.manual_seed(2025)
2025-05-07T20:33:10.2707034Z     
2025-05-07T20:33:10.2707309Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.2707663Z     
2025-05-07T20:33:10.2707856Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.2708245Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.2708567Z         x = x_sign * x_clamp
2025-05-07T20:33:10.2708809Z         x0 = x[:, :D]
2025-05-07T20:33:10.2709020Z         x1 = x[:, D:]
2025-05-07T20:33:10.2709227Z     
2025-05-07T20:33:10.2709406Z         if contiguous:
2025-05-07T20:33:10.2709673Z             x0 = x0.contiguous()
2025-05-07T20:33:10.2709934Z             x1 = x1.contiguous()
2025-05-07T20:33:10.2710180Z     
2025-05-07T20:33:10.2710365Z         if scale_ub is not None:
2025-05-07T20:33:10.2710642Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.2710981Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.2711292Z             )
2025-05-07T20:33:10.2711494Z         else:
2025-05-07T20:33:10.2711706Z             scale_ub_tensor = None
2025-05-07T20:33:10.2711955Z     
2025-05-07T20:33:10.2712186Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.2712501Z             op = silu_mul_quant
2025-05-07T20:33:10.2712752Z             if compiled:
2025-05-07T20:33:10.2712997Z                 op = torch.compile(op)
2025-05-07T20:33:10.2713295Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.2713572Z     
2025-05-07T20:33:10.2713755Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.2713924Z 
2025-05-07T20:33:10.2714022Z moe/activation_test.py:117: 
2025-05-07T20:33:10.2714320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.2714647Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.2714928Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.2715645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.2716370Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.2716922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.2717644Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.2718339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.2718890Z     kernel = self.compile(
2025-05-07T20:33:10.2719451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.2720138Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.2720546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.2720782Z 
2025-05-07T20:33:10.2720993Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942a6bcb0>
2025-05-07T20:33:10.2722124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.2723550Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9856c30680>}
2025-05-07T20:33:10.2724959Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.2726329Z context = <triton._C.libtriton.ir.context object at 0x7f9856ce42b0>
2025-05-07T20:33:10.2726626Z 
2025-05-07T20:33:10.2726797Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.2727335Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.2727913Z                            module_map=module_map)
2025-05-07T20:33:10.2728284Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.2728660Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.2728975Z E       ^
2025-05-07T20:33:10.2729578Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.2730054Z 
2025-05-07T20:33:10.2730491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.2731096Z 
2025-05-07T20:33:10.4478664Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.4479097Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.4479945Z     T=4096,
2025-05-07T20:33:10.4480462Z     D=7168,
2025-05-07T20:33:10.4480882Z     scale_ub=1200.0,
2025-05-07T20:33:10.4481342Z     contiguous=False,
2025-05-07T20:33:10.4481827Z     compiled=True,
2025-05-07T20:33:10.4482221Z )
2025-05-07T20:33:10.4482864Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.4483891Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:10.4484473Z 
2025-05-07T20:33:10.4484632Z     @given(
2025-05-07T20:33:10.4485082Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.4485718Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.4486332Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.4486999Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.4487669Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.4488244Z     )
2025-05-07T20:33:10.4488760Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.4489249Z     def test_silu_mul_quant(
2025-05-07T20:33:10.4489501Z         self,
2025-05-07T20:33:10.4489696Z         T: int,
2025-05-07T20:33:10.4489894Z         D: int,
2025-05-07T20:33:10.4490121Z         scale_ub: Optional[float],
2025-05-07T20:33:10.4490395Z         contiguous: bool,
2025-05-07T20:33:10.4490645Z         compiled: bool,
2025-05-07T20:33:10.4490872Z     ) -> None:
2025-05-07T20:33:10.4491084Z         torch.manual_seed(2025)
2025-05-07T20:33:10.4491326Z     
2025-05-07T20:33:10.4491602Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.4491948Z     
2025-05-07T20:33:10.4492137Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.4492427Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.4492734Z         x = x_sign * x_clamp
2025-05-07T20:33:10.4492972Z         x0 = x[:, :D]
2025-05-07T20:33:10.4493183Z         x1 = x[:, D:]
2025-05-07T20:33:10.4493388Z     
2025-05-07T20:33:10.4493564Z         if contiguous:
2025-05-07T20:33:10.4493796Z             x0 = x0.contiguous()
2025-05-07T20:33:10.4494054Z             x1 = x1.contiguous()
2025-05-07T20:33:10.4494285Z     
2025-05-07T20:33:10.4494596Z         if scale_ub is not None:
2025-05-07T20:33:10.4494873Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.4495209Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.4495529Z             )
2025-05-07T20:33:10.4495719Z         else:
2025-05-07T20:33:10.4495924Z             scale_ub_tensor = None
2025-05-07T20:33:10.4496174Z     
2025-05-07T20:33:10.4496401Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.4496717Z             op = silu_mul_quant
2025-05-07T20:33:10.4496968Z             if compiled:
2025-05-07T20:33:10.4497222Z                 op = torch.compile(op)
2025-05-07T20:33:10.4497511Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.4497783Z     
2025-05-07T20:33:10.4497970Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.4498139Z 
2025-05-07T20:33:10.4498246Z moe/activation_test.py:117: 
2025-05-07T20:33:10.4498676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.4499046Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.4499330Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.4500025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.4500627Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.4501320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.4502139Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.4502700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.4503417Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.4504125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.4504685Z     kernel = self.compile(
2025-05-07T20:33:10.4505248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.4505945Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.4506368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.4506611Z 
2025-05-07T20:33:10.4506821Z self = <triton.compiler.compiler.ASTSource object at 0x7f99427730e0>
2025-05-07T20:33:10.4507940Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.4509373Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9856c31940>}
2025-05-07T20:33:10.4510787Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.4511872Z context = <triton._C.libtriton.ir.context object at 0x7f9856d7dc70>
2025-05-07T20:33:10.4512176Z 
2025-05-07T20:33:10.4512345Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.4512886Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.4513375Z                            module_map=module_map)
2025-05-07T20:33:10.4513742Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.4514107Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.4514373Z E       ^
2025-05-07T20:33:10.4514849Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.4515331Z 
2025-05-07T20:33:10.4515766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.4516316Z 
2025-05-07T20:33:10.4516421Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.4516851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.4517260Z     T=128,
2025-05-07T20:33:10.4517446Z     D=7168,
2025-05-07T20:33:10.4517639Z     scale_ub=1200.0,
2025-05-07T20:33:10.4517862Z     contiguous=False,
2025-05-07T20:33:10.4518097Z     compiled=True,
2025-05-07T20:33:10.4518307Z )
2025-05-07T20:33:10.5436688Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.5437317Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:10.5437721Z 
2025-05-07T20:33:10.5437832Z     @given(
2025-05-07T20:33:10.5438150Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.5438679Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.5438999Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.5439328Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.5439665Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.5440081Z     )
2025-05-07T20:33:10.5440447Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.5440902Z     def test_silu_mul_quant(
2025-05-07T20:33:10.5441143Z         self,
2025-05-07T20:33:10.5441338Z         T: int,
2025-05-07T20:33:10.5441607Z         D: int,
2025-05-07T20:33:10.5441834Z         scale_ub: Optional[float],
2025-05-07T20:33:10.5442109Z         contiguous: bool,
2025-05-07T20:33:10.5442354Z         compiled: bool,
2025-05-07T20:33:10.5442573Z     ) -> None:
2025-05-07T20:33:10.5442792Z         torch.manual_seed(2025)
2025-05-07T20:33:10.5443037Z     
2025-05-07T20:33:10.5443309Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.5443668Z     
2025-05-07T20:33:10.5443865Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.5444149Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.5444469Z         x = x_sign * x_clamp
2025-05-07T20:33:10.5444718Z         x0 = x[:, :D]
2025-05-07T20:33:10.5444938Z         x1 = x[:, D:]
2025-05-07T20:33:10.5445147Z     
2025-05-07T20:33:10.5445335Z         if contiguous:
2025-05-07T20:33:10.5445563Z             x0 = x0.contiguous()
2025-05-07T20:33:10.5445829Z             x1 = x1.contiguous()
2025-05-07T20:33:10.5446081Z     
2025-05-07T20:33:10.5446275Z         if scale_ub is not None:
2025-05-07T20:33:10.5446549Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.5446893Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.5447204Z             )
2025-05-07T20:33:10.5447403Z         else:
2025-05-07T20:33:10.5447630Z             scale_ub_tensor = None
2025-05-07T20:33:10.5447897Z     
2025-05-07T20:33:10.5448130Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.5448451Z             op = silu_mul_quant
2025-05-07T20:33:10.5448705Z             if compiled:
2025-05-07T20:33:10.5448951Z                 op = torch.compile(op)
2025-05-07T20:33:10.5449260Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.5449535Z     
2025-05-07T20:33:10.5449724Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.5449894Z 
2025-05-07T20:33:10.5449993Z moe/activation_test.py:117: 
2025-05-07T20:33:10.5450289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.5450637Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.5450917Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.5451496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.5452082Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.5452892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.5453624Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.5454182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.5454951Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.5455642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.5456205Z     kernel = self.compile(
2025-05-07T20:33:10.5456772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.5457457Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.5457874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.5458119Z 
2025-05-07T20:33:10.5458410Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942771b80>
2025-05-07T20:33:10.5459606Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.5461028Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9856c32700>}
2025-05-07T20:33:10.5462474Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.5463563Z context = <triton._C.libtriton.ir.context object at 0x7f9856d87e30>
2025-05-07T20:33:10.5463868Z 
2025-05-07T20:33:10.5464039Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.5464583Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.5465068Z                            module_map=module_map)
2025-05-07T20:33:10.5465471Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.5465853Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.5466133Z E       ^
2025-05-07T20:33:10.5466612Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.5467092Z 
2025-05-07T20:33:10.5467535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.5468084Z 
2025-05-07T20:33:10.5468192Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.5468626Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.5469050Z     T=2048,
2025-05-07T20:33:10.5469257Z     D=7168,
2025-05-07T20:33:10.5469461Z     scale_ub=None,
2025-05-07T20:33:10.5469677Z     contiguous=True,
2025-05-07T20:33:10.5469909Z     compiled=True,
2025-05-07T20:33:10.5470117Z )
2025-05-07T20:33:10.5470446Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.5470962Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:10.5471247Z 
2025-05-07T20:33:10.5471332Z     @given(
2025-05-07T20:33:10.5471569Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.5471885Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.5472205Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.5472547Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.5472881Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.5473177Z     )
2025-05-07T20:33:10.5473539Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.5473996Z     def test_silu_mul_quant(
2025-05-07T20:33:10.5474245Z         self,
2025-05-07T20:33:10.5474448Z         T: int,
2025-05-07T20:33:10.5474641Z         D: int,
2025-05-07T20:33:10.5474862Z         scale_ub: Optional[float],
2025-05-07T20:33:10.5475141Z         contiguous: bool,
2025-05-07T20:33:10.5475391Z         compiled: bool,
2025-05-07T20:33:10.5475612Z     ) -> None:
2025-05-07T20:33:10.5475830Z         torch.manual_seed(2025)
2025-05-07T20:33:10.5476077Z     
2025-05-07T20:33:10.5476346Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.5476701Z     
2025-05-07T20:33:10.5476895Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.5477179Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.5477493Z         x = x_sign * x_clamp
2025-05-07T20:33:10.5477733Z         x0 = x[:, :D]
2025-05-07T20:33:10.5477945Z         x1 = x[:, D:]
2025-05-07T20:33:10.5478154Z     
2025-05-07T20:33:10.5478333Z         if contiguous:
2025-05-07T20:33:10.5478612Z             x0 = x0.contiguous()
2025-05-07T20:33:10.5478868Z             x1 = x1.contiguous()
2025-05-07T20:33:10.5479105Z     
2025-05-07T20:33:10.5479292Z         if scale_ub is not None:
2025-05-07T20:33:10.5479565Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.5479977Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.5480291Z             )
2025-05-07T20:33:10.5480476Z         else:
2025-05-07T20:33:10.5480683Z             scale_ub_tensor = None
2025-05-07T20:33:10.5480936Z     
2025-05-07T20:33:10.5481163Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.5481532Z             op = silu_mul_quant
2025-05-07T20:33:10.5481789Z             if compiled:
2025-05-07T20:33:10.5482039Z                 op = torch.compile(op)
2025-05-07T20:33:10.5482350Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.5482636Z     
2025-05-07T20:33:10.5482831Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.5483008Z 
2025-05-07T20:33:10.5483112Z moe/activation_test.py:117: 
2025-05-07T20:33:10.5483420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.5483763Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.5484054Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.5484648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:10.5485235Z     return fn(*args, **kwargs)
2025-05-07T20:33:10.5485919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.5486649Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.5487212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.5487928Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.5488630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.5489201Z     kernel = self.compile(
2025-05-07T20:33:10.5489765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.5490455Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.5490871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.5491113Z 
2025-05-07T20:33:10.5491332Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942ff8050>
2025-05-07T20:33:10.5492464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.5493886Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9856c337e0>}
2025-05-07T20:33:10.5495386Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.5496478Z context = <triton._C.libtriton.ir.context object at 0x7f9856b17b30>
2025-05-07T20:33:10.5496779Z 
2025-05-07T20:33:10.5496957Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.5497497Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.5497989Z                            module_map=module_map)
2025-05-07T20:33:10.5498364Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.5498729Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.5498995Z E       ^
2025-05-07T20:33:10.5499522Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.5500043Z 
2025-05-07T20:33:10.5500480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.5501020Z 
2025-05-07T20:33:10.6100871Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.6101523Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.6102100Z     T=16384,
2025-05-07T20:33:10.6102364Z     D=5120,
2025-05-07T20:33:10.6102633Z     scale_ub=None,
2025-05-07T20:33:10.6102997Z     contiguous=False,
2025-05-07T20:33:10.6103231Z     compiled=False,
2025-05-07T20:33:10.6103441Z )
2025-05-07T20:33:10.6103764Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.6104292Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:10.6104585Z 
2025-05-07T20:33:10.6104676Z     @given(
2025-05-07T20:33:10.6104911Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.6105232Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.6105546Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.6105881Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.6106229Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.6106528Z     )
2025-05-07T20:33:10.6106885Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.6107335Z     def test_silu_mul_quant(
2025-05-07T20:33:10.6107579Z         self,
2025-05-07T20:33:10.6107770Z         T: int,
2025-05-07T20:33:10.6107965Z         D: int,
2025-05-07T20:33:10.6108183Z         scale_ub: Optional[float],
2025-05-07T20:33:10.6108461Z         contiguous: bool,
2025-05-07T20:33:10.6108697Z         compiled: bool,
2025-05-07T20:33:10.6108923Z     ) -> None:
2025-05-07T20:33:10.6109140Z         torch.manual_seed(2025)
2025-05-07T20:33:10.6109380Z     
2025-05-07T20:33:10.6109649Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.6110002Z     
2025-05-07T20:33:10.6110190Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.6110480Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.6112638Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.6114661Z 
2025-05-07T20:33:10.6114779Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:10.6115004Z 
2025-05-07T20:33:10.6115104Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.6115524Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.6115957Z     T=4096,
2025-05-07T20:33:10.6116139Z     D=7168,
2025-05-07T20:33:10.6116330Z     scale_ub=1200.0,
2025-05-07T20:33:10.6116559Z     contiguous=True,
2025-05-07T20:33:10.6116771Z     compiled=True,
2025-05-07T20:33:10.6116971Z )
2025-05-07T20:33:10.6117292Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.6117792Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:10.6118082Z 
2025-05-07T20:33:10.6118160Z     @given(
2025-05-07T20:33:10.6125921Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.6126268Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.6126586Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.6126915Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.6127367Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.6127650Z     )
2025-05-07T20:33:10.6127995Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.6128447Z     def test_silu_mul_quant(
2025-05-07T20:33:10.6128685Z         self,
2025-05-07T20:33:10.6128983Z         T: int,
2025-05-07T20:33:10.6129183Z         D: int,
2025-05-07T20:33:10.6129400Z         scale_ub: Optional[float],
2025-05-07T20:33:10.6129676Z         contiguous: bool,
2025-05-07T20:33:10.6129920Z         compiled: bool,
2025-05-07T20:33:10.6130205Z     ) -> None:
2025-05-07T20:33:10.6130418Z         torch.manual_seed(2025)
2025-05-07T20:33:10.6130659Z     
2025-05-07T20:33:10.6130944Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.6131302Z     
2025-05-07T20:33:10.6131494Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.6131790Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.6133961Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.6136058Z 
2025-05-07T20:33:10.6136186Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:10.6136405Z 
2025-05-07T20:33:10.6136509Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.6136940Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.6137361Z     T=16384,
2025-05-07T20:33:10.6137568Z     D=7168,
2025-05-07T20:33:10.6137756Z     scale_ub=None,
2025-05-07T20:33:10.6137964Z     contiguous=False,
2025-05-07T20:33:10.6138186Z     compiled=False,
2025-05-07T20:33:10.6138381Z )
2025-05-07T20:33:10.6138728Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.6139273Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:10.6139555Z 
2025-05-07T20:33:10.6139628Z     @given(
2025-05-07T20:33:10.6139849Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.6140162Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.6140461Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.6140789Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.6141122Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.6141407Z     )
2025-05-07T20:33:10.6141749Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.6142201Z     def test_silu_mul_quant(
2025-05-07T20:33:10.6142440Z         self,
2025-05-07T20:33:10.6142624Z         T: int,
2025-05-07T20:33:10.6142815Z         D: int,
2025-05-07T20:33:10.6143029Z         scale_ub: Optional[float],
2025-05-07T20:33:10.6143293Z         contiguous: bool,
2025-05-07T20:33:10.6143532Z         compiled: bool,
2025-05-07T20:33:10.6143753Z     ) -> None:
2025-05-07T20:33:10.6143957Z         torch.manual_seed(2025)
2025-05-07T20:33:10.6144197Z     
2025-05-07T20:33:10.6144468Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.6146665Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.6148732Z 
2025-05-07T20:33:10.6148854Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.6149072Z 
2025-05-07T20:33:10.6149175Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.6149672Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.6150095Z     T=2048,
2025-05-07T20:33:10.6150273Z     D=7168,
2025-05-07T20:33:10.6150458Z     scale_ub=1200.0,
2025-05-07T20:33:10.6150714Z     contiguous=True,
2025-05-07T20:33:10.6150924Z     compiled=True,
2025-05-07T20:33:10.6151122Z )
2025-05-07T20:33:10.6151440Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.6151933Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:10.6152214Z 
2025-05-07T20:33:10.6152290Z     @given(
2025-05-07T20:33:10.6152517Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.6152836Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.6153139Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.6153472Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.6153810Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.6154088Z     )
2025-05-07T20:33:10.6154440Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.6154888Z     def test_silu_mul_quant(
2025-05-07T20:33:10.6155124Z         self,
2025-05-07T20:33:10.6155319Z         T: int,
2025-05-07T20:33:10.6155507Z         D: int,
2025-05-07T20:33:10.6155714Z         scale_ub: Optional[float],
2025-05-07T20:33:10.6155985Z         contiguous: bool,
2025-05-07T20:33:10.6156219Z         compiled: bool,
2025-05-07T20:33:10.6156436Z     ) -> None:
2025-05-07T20:33:10.6156638Z         torch.manual_seed(2025)
2025-05-07T20:33:10.6156882Z     
2025-05-07T20:33:10.6157154Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.6157495Z     
2025-05-07T20:33:10.6157680Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.6157966Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.6160084Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.6162081Z 
2025-05-07T20:33:10.6162195Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:10.6162410Z 
2025-05-07T20:33:10.6162514Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.6162927Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.6163339Z     T=2048,
2025-05-07T20:33:10.6163525Z     D=7168,
2025-05-07T20:33:10.6163717Z     scale_ub=None,
2025-05-07T20:33:10.6163934Z     contiguous=True,
2025-05-07T20:33:10.6164157Z     compiled=False,
2025-05-07T20:33:10.6164371Z )
2025-05-07T20:33:10.7299845Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.7301265Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:10.7301858Z 
2025-05-07T20:33:10.7302013Z     @given(
2025-05-07T20:33:10.7302466Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.7303086Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.7303699Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.7304356Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.7305265Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.7305827Z     )
2025-05-07T20:33:10.7306517Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.7307411Z     def test_silu_mul_quant(
2025-05-07T20:33:10.7307872Z         self,
2025-05-07T20:33:10.7308477Z         T: int,
2025-05-07T20:33:10.7308787Z         D: int,
2025-05-07T20:33:10.7309030Z         scale_ub: Optional[float],
2025-05-07T20:33:10.7309321Z         contiguous: bool,
2025-05-07T20:33:10.7309560Z         compiled: bool,
2025-05-07T20:33:10.7309776Z     ) -> None:
2025-05-07T20:33:10.7310059Z         torch.manual_seed(2025)
2025-05-07T20:33:10.7310311Z     
2025-05-07T20:33:10.7310587Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.7310942Z     
2025-05-07T20:33:10.7311150Z >       x_sign = torch.sign(x)
2025-05-07T20:33:10.7313233Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.7315230Z 
2025-05-07T20:33:10.7315358Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:10.7315580Z 
2025-05-07T20:33:10.7315684Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.7316116Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.7316537Z     T=1,
2025-05-07T20:33:10.7316721Z     D=7168,
2025-05-07T20:33:10.7316920Z     scale_ub=1200.0,
2025-05-07T20:33:10.7317149Z     contiguous=True,
2025-05-07T20:33:10.7317367Z     compiled=False,
2025-05-07T20:33:10.7317583Z )
2025-05-07T20:33:10.7317908Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.7318407Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:10.7318689Z 
2025-05-07T20:33:10.7318772Z     @given(
2025-05-07T20:33:10.7319012Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.7319333Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.7319644Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.7319981Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.7320321Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.7320614Z     )
2025-05-07T20:33:10.7320968Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.7321436Z     def test_silu_mul_quant(
2025-05-07T20:33:10.7321683Z         self,
2025-05-07T20:33:10.7321885Z         T: int,
2025-05-07T20:33:10.7322077Z         D: int,
2025-05-07T20:33:10.7322295Z         scale_ub: Optional[float],
2025-05-07T20:33:10.7322573Z         contiguous: bool,
2025-05-07T20:33:10.7322810Z         compiled: bool,
2025-05-07T20:33:10.7323028Z     ) -> None:
2025-05-07T20:33:10.7323249Z         torch.manual_seed(2025)
2025-05-07T20:33:10.7323497Z     
2025-05-07T20:33:10.7323775Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.7324118Z     
2025-05-07T20:33:10.7324314Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.7324603Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.7324917Z         x = x_sign * x_clamp
2025-05-07T20:33:10.7325154Z         x0 = x[:, :D]
2025-05-07T20:33:10.7325370Z         x1 = x[:, D:]
2025-05-07T20:33:10.7325758Z     
2025-05-07T20:33:10.7325950Z         if contiguous:
2025-05-07T20:33:10.7326189Z             x0 = x0.contiguous()
2025-05-07T20:33:10.7326452Z             x1 = x1.contiguous()
2025-05-07T20:33:10.7326701Z     
2025-05-07T20:33:10.7326976Z         if scale_ub is not None:
2025-05-07T20:33:10.7327244Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.7327583Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.7327896Z             )
2025-05-07T20:33:10.7328091Z         else:
2025-05-07T20:33:10.7328443Z             scale_ub_tensor = None
2025-05-07T20:33:10.7328703Z     
2025-05-07T20:33:10.7328929Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.7329277Z             op = silu_mul_quant
2025-05-07T20:33:10.7329547Z             if compiled:
2025-05-07T20:33:10.7329853Z                 op = torch.compile(op)
2025-05-07T20:33:10.7330152Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.7330428Z     
2025-05-07T20:33:10.7330616Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.7330784Z 
2025-05-07T20:33:10.7330883Z moe/activation_test.py:117: 
2025-05-07T20:33:10.7331180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.7331524Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.7331804Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.7332526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.7333265Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.7333819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.7334648Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.7335348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.7335909Z     kernel = self.compile(
2025-05-07T20:33:10.7336469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.7337159Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.7337576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.7337817Z 
2025-05-07T20:33:10.7338032Z self = <triton.compiler.compiler.ASTSource object at 0x7f98574379b0>
2025-05-07T20:33:10.7339159Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.7340582Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9856b22b60>}
2025-05-07T20:33:10.7341992Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.7343083Z context = <triton._C.libtriton.ir.context object at 0x7f9856b8fa30>
2025-05-07T20:33:10.7343382Z 
2025-05-07T20:33:10.7343549Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.7344086Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.7344575Z                            module_map=module_map)
2025-05-07T20:33:10.7344945Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.7345298Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.7345563Z E       ^
2025-05-07T20:33:10.7346039Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.7346511Z 
2025-05-07T20:33:10.7346951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.7347495Z 
2025-05-07T20:33:10.7347600Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.7348083Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.7348502Z     T=128,
2025-05-07T20:33:10.7348685Z     D=5120,
2025-05-07T20:33:10.7348879Z     scale_ub=None,
2025-05-07T20:33:10.7349087Z     contiguous=True,
2025-05-07T20:33:10.7349305Z     compiled=False,
2025-05-07T20:33:10.7349589Z )
2025-05-07T20:33:10.8028475Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.8029290Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:10.8029829Z 
2025-05-07T20:33:10.8029937Z     @given(
2025-05-07T20:33:10.8030260Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.8030650Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.8030960Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.8031291Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.8031624Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.8031922Z     )
2025-05-07T20:33:10.8032269Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.8032726Z     def test_silu_mul_quant(
2025-05-07T20:33:10.8032968Z         self,
2025-05-07T20:33:10.8033156Z         T: int,
2025-05-07T20:33:10.8033360Z         D: int,
2025-05-07T20:33:10.8033575Z         scale_ub: Optional[float],
2025-05-07T20:33:10.8033840Z         contiguous: bool,
2025-05-07T20:33:10.8034074Z         compiled: bool,
2025-05-07T20:33:10.8034293Z     ) -> None:
2025-05-07T20:33:10.8034502Z         torch.manual_seed(2025)
2025-05-07T20:33:10.8034757Z     
2025-05-07T20:33:10.8035043Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.8035401Z     
2025-05-07T20:33:10.8035596Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.8035894Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.8036216Z         x = x_sign * x_clamp
2025-05-07T20:33:10.8036459Z         x0 = x[:, :D]
2025-05-07T20:33:10.8036684Z         x1 = x[:, D:]
2025-05-07T20:33:10.8036901Z     
2025-05-07T20:33:10.8037088Z         if contiguous:
2025-05-07T20:33:10.8037327Z             x0 = x0.contiguous()
2025-05-07T20:33:10.8037592Z             x1 = x1.contiguous()
2025-05-07T20:33:10.8037840Z     
2025-05-07T20:33:10.8038043Z         if scale_ub is not None:
2025-05-07T20:33:10.8038322Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.8038662Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.8038985Z             )
2025-05-07T20:33:10.8039186Z         else:
2025-05-07T20:33:10.8039397Z             scale_ub_tensor = None
2025-05-07T20:33:10.8039659Z     
2025-05-07T20:33:10.8039892Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.8040207Z             op = silu_mul_quant
2025-05-07T20:33:10.8040450Z             if compiled:
2025-05-07T20:33:10.8040697Z                 op = torch.compile(op)
2025-05-07T20:33:10.8040997Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.8041272Z     
2025-05-07T20:33:10.8041458Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.8041620Z 
2025-05-07T20:33:10.8041724Z moe/activation_test.py:117: 
2025-05-07T20:33:10.8042024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.8042366Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.8042649Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.8043362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.8044091Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.8044652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.8045370Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.8046059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.8046704Z     kernel = self.compile(
2025-05-07T20:33:10.8047274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.8048083Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.8048500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.8048744Z 
2025-05-07T20:33:10.8048958Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857437890>
2025-05-07T20:33:10.8050128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.8051553Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9856b23c40>}
2025-05-07T20:33:10.8052966Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.8054060Z context = <triton._C.libtriton.ir.context object at 0x7f9856787db0>
2025-05-07T20:33:10.8054482Z 
2025-05-07T20:33:10.8054658Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.8055202Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.8055693Z                            module_map=module_map)
2025-05-07T20:33:10.8056072Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.8056436Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.8056700Z E       ^
2025-05-07T20:33:10.8057183Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.8057665Z 
2025-05-07T20:33:10.8058102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.8058646Z 
2025-05-07T20:33:10.8058765Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.8059236Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.8059653Z     T=128,
2025-05-07T20:33:10.8059844Z     D=7168,
2025-05-07T20:33:10.8060037Z     scale_ub=None,
2025-05-07T20:33:10.8060255Z     contiguous=True,
2025-05-07T20:33:10.8060485Z     compiled=False,
2025-05-07T20:33:10.8060691Z )
2025-05-07T20:33:10.8061015Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.8061529Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:10.8061809Z 
2025-05-07T20:33:10.8061894Z     @given(
2025-05-07T20:33:10.8062122Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.8062445Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.8062760Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.8063096Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.8063440Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.8063736Z     )
2025-05-07T20:33:10.8064087Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.8064547Z     def test_silu_mul_quant(
2025-05-07T20:33:10.8064795Z         self,
2025-05-07T20:33:10.8065002Z         T: int,
2025-05-07T20:33:10.8065191Z         D: int,
2025-05-07T20:33:10.8065409Z         scale_ub: Optional[float],
2025-05-07T20:33:10.8065678Z         contiguous: bool,
2025-05-07T20:33:10.8065915Z         compiled: bool,
2025-05-07T20:33:10.8066136Z     ) -> None:
2025-05-07T20:33:10.8066350Z         torch.manual_seed(2025)
2025-05-07T20:33:10.8066588Z     
2025-05-07T20:33:10.8066914Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.8067264Z     
2025-05-07T20:33:10.8067455Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.8067750Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.8068063Z         x = x_sign * x_clamp
2025-05-07T20:33:10.8068371Z         x0 = x[:, :D]
2025-05-07T20:33:10.8068595Z         x1 = x[:, D:]
2025-05-07T20:33:10.8068807Z     
2025-05-07T20:33:10.8068987Z         if contiguous:
2025-05-07T20:33:10.8069221Z             x0 = x0.contiguous()
2025-05-07T20:33:10.8069521Z             x1 = x1.contiguous()
2025-05-07T20:33:10.8069764Z     
2025-05-07T20:33:10.8069968Z         if scale_ub is not None:
2025-05-07T20:33:10.8070243Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.8070581Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.8070891Z             )
2025-05-07T20:33:10.8071093Z         else:
2025-05-07T20:33:10.8071308Z             scale_ub_tensor = None
2025-05-07T20:33:10.8071561Z     
2025-05-07T20:33:10.8071789Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.8072107Z             op = silu_mul_quant
2025-05-07T20:33:10.8072359Z             if compiled:
2025-05-07T20:33:10.8072598Z                 op = torch.compile(op)
2025-05-07T20:33:10.8072905Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.8073186Z     
2025-05-07T20:33:10.8073376Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.8073542Z 
2025-05-07T20:33:10.8073639Z moe/activation_test.py:117: 
2025-05-07T20:33:10.8073941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.8074278Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.8074560Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.8075278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.8076003Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.8076563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.8077277Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.8077972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.8078524Z     kernel = self.compile(
2025-05-07T20:33:10.8079089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.8079781Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.8080193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.8080433Z 
2025-05-07T20:33:10.8080642Z self = <triton.compiler.compiler.ASTSource object at 0x7f9856a046b0>
2025-05-07T20:33:10.8081762Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.8083187Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9856a00ae0>}
2025-05-07T20:33:10.8084592Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.8085678Z context = <triton._C.libtriton.ir.context object at 0x7f9856a21db0>
2025-05-07T20:33:10.8085976Z 
2025-05-07T20:33:10.8086145Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.8086687Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.8087225Z                            module_map=module_map)
2025-05-07T20:33:10.8087592Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.8087967Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.8088327Z E       ^
2025-05-07T20:33:10.8088919Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.8089435Z 
2025-05-07T20:33:10.8089903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.8090496Z 
2025-05-07T20:33:10.8090604Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.8091159Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.8098705Z     T=2048,
2025-05-07T20:33:10.8099015Z     D=7168,
2025-05-07T20:33:10.8099215Z     scale_ub=1200.0,
2025-05-07T20:33:10.8099442Z     contiguous=True,
2025-05-07T20:33:10.8099665Z     compiled=False,
2025-05-07T20:33:10.8099872Z )
2025-05-07T20:33:10.8907972Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.8908801Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:10.8909230Z 
2025-05-07T20:33:10.8909339Z     @given(
2025-05-07T20:33:10.8909664Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.8909982Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.8910284Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.8910620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.8910951Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.8911231Z     )
2025-05-07T20:33:10.8911585Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.8912043Z     def test_silu_mul_quant(
2025-05-07T20:33:10.8912278Z         self,
2025-05-07T20:33:10.8912474Z         T: int,
2025-05-07T20:33:10.8912675Z         D: int,
2025-05-07T20:33:10.8912887Z         scale_ub: Optional[float],
2025-05-07T20:33:10.8913159Z         contiguous: bool,
2025-05-07T20:33:10.8913393Z         compiled: bool,
2025-05-07T20:33:10.8913614Z     ) -> None:
2025-05-07T20:33:10.8913830Z         torch.manual_seed(2025)
2025-05-07T20:33:10.8914069Z     
2025-05-07T20:33:10.8914351Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.8916536Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.8918541Z 
2025-05-07T20:33:10.8918658Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.8918877Z 
2025-05-07T20:33:10.8918982Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.8919402Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.8919819Z     T=1,
2025-05-07T20:33:10.8919998Z     D=5120,
2025-05-07T20:33:10.8920181Z     scale_ub=1200.0,
2025-05-07T20:33:10.8920399Z     contiguous=True,
2025-05-07T20:33:10.8920613Z     compiled=False,
2025-05-07T20:33:10.8920818Z )
2025-05-07T20:33:10.8921145Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.8921637Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:10.8921911Z 
2025-05-07T20:33:10.8921988Z     @given(
2025-05-07T20:33:10.8922207Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.8922518Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.8922950Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.8923285Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.8923617Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.8923900Z     )
2025-05-07T20:33:10.8924379Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.8924836Z     def test_silu_mul_quant(
2025-05-07T20:33:10.8925072Z         self,
2025-05-07T20:33:10.8925263Z         T: int,
2025-05-07T20:33:10.8925629Z         D: int,
2025-05-07T20:33:10.8925844Z         scale_ub: Optional[float],
2025-05-07T20:33:10.8926194Z         contiguous: bool,
2025-05-07T20:33:10.8926448Z         compiled: bool,
2025-05-07T20:33:10.8926663Z     ) -> None:
2025-05-07T20:33:10.8926881Z         torch.manual_seed(2025)
2025-05-07T20:33:10.8927119Z     
2025-05-07T20:33:10.8927391Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.8927744Z     
2025-05-07T20:33:10.8927946Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.8928238Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.8928556Z         x = x_sign * x_clamp
2025-05-07T20:33:10.8928801Z         x0 = x[:, :D]
2025-05-07T20:33:10.8929016Z         x1 = x[:, D:]
2025-05-07T20:33:10.8929228Z     
2025-05-07T20:33:10.8929430Z         if contiguous:
2025-05-07T20:33:10.8929667Z             x0 = x0.contiguous()
2025-05-07T20:33:10.8929932Z             x1 = x1.contiguous()
2025-05-07T20:33:10.8930183Z     
2025-05-07T20:33:10.8930373Z         if scale_ub is not None:
2025-05-07T20:33:10.8930653Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.8930993Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.8931317Z             )
2025-05-07T20:33:10.8931505Z         else:
2025-05-07T20:33:10.8931718Z             scale_ub_tensor = None
2025-05-07T20:33:10.8931974Z     
2025-05-07T20:33:10.8932207Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.8932532Z             op = silu_mul_quant
2025-05-07T20:33:10.8932782Z             if compiled:
2025-05-07T20:33:10.8933024Z                 op = torch.compile(op)
2025-05-07T20:33:10.8933331Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.8933616Z     
2025-05-07T20:33:10.8933813Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.8933984Z 
2025-05-07T20:33:10.8934084Z moe/activation_test.py:117: 
2025-05-07T20:33:10.8934486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.8934829Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.8935113Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.8935837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.8936566Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.8937122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.8937845Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.8938543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.8939162Z     kernel = self.compile(
2025-05-07T20:33:10.8939721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.8940413Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.8940825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.8941062Z 
2025-05-07T20:33:10.8941276Z self = <triton.compiler.compiler.ASTSource object at 0x7f9856a057c0>
2025-05-07T20:33:10.8942402Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.8943908Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9856a020c0>}
2025-05-07T20:33:10.8945460Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.8946556Z context = <triton._C.libtriton.ir.context object at 0x7f985674da30>
2025-05-07T20:33:10.8946899Z 
2025-05-07T20:33:10.8947074Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.8947623Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.8948121Z                            module_map=module_map)
2025-05-07T20:33:10.8948501Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.8948870Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.8949151Z E       ^
2025-05-07T20:33:10.8949645Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.8950121Z 
2025-05-07T20:33:10.8950566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.8951116Z 
2025-05-07T20:33:10.8951221Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.8951651Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.8952072Z     T=2048,
2025-05-07T20:33:10.8952260Z     D=5120,
2025-05-07T20:33:10.8952461Z     scale_ub=None,
2025-05-07T20:33:10.8952681Z     contiguous=True,
2025-05-07T20:33:10.8952907Z     compiled=False,
2025-05-07T20:33:10.8953114Z )
2025-05-07T20:33:10.8953438Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.8953948Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:10.8954240Z 
2025-05-07T20:33:10.8954322Z     @given(
2025-05-07T20:33:10.8954560Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.8954885Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.8955202Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.8955543Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.8955892Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.8956185Z     )
2025-05-07T20:33:10.8956546Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.8957011Z     def test_silu_mul_quant(
2025-05-07T20:33:10.8957257Z         self,
2025-05-07T20:33:10.8957462Z         T: int,
2025-05-07T20:33:10.8957660Z         D: int,
2025-05-07T20:33:10.8957881Z         scale_ub: Optional[float],
2025-05-07T20:33:10.8958157Z         contiguous: bool,
2025-05-07T20:33:10.8958410Z         compiled: bool,
2025-05-07T20:33:10.8958627Z     ) -> None:
2025-05-07T20:33:10.8958845Z         torch.manual_seed(2025)
2025-05-07T20:33:10.8959089Z     
2025-05-07T20:33:10.8959370Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.8959718Z     
2025-05-07T20:33:10.8959925Z >       x_sign = torch.sign(x)
2025-05-07T20:33:10.8962005Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.8964057Z 
2025-05-07T20:33:10.8964185Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:10.8964405Z 
2025-05-07T20:33:10.8964511Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.8964942Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.8965368Z     T=16384,
2025-05-07T20:33:10.8965648Z     D=5120,
2025-05-07T20:33:10.8965843Z     scale_ub=None,
2025-05-07T20:33:10.8966063Z     contiguous=True,
2025-05-07T20:33:10.8966286Z     compiled=False,
2025-05-07T20:33:10.8966480Z )
2025-05-07T20:33:10.9726614Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.9727518Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:10.9727922Z 
2025-05-07T20:33:10.9728033Z     @given(
2025-05-07T20:33:10.9728340Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.9728793Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.9729222Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.9729632Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.9729961Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.9730252Z     )
2025-05-07T20:33:10.9730594Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.9731049Z     def test_silu_mul_quant(
2025-05-07T20:33:10.9731289Z         self,
2025-05-07T20:33:10.9731473Z         T: int,
2025-05-07T20:33:10.9731660Z         D: int,
2025-05-07T20:33:10.9731868Z         scale_ub: Optional[float],
2025-05-07T20:33:10.9732149Z         contiguous: bool,
2025-05-07T20:33:10.9732379Z         compiled: bool,
2025-05-07T20:33:10.9732599Z     ) -> None:
2025-05-07T20:33:10.9732811Z         torch.manual_seed(2025)
2025-05-07T20:33:10.9733043Z     
2025-05-07T20:33:10.9733309Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.9735552Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.9737552Z 
2025-05-07T20:33:10.9737674Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.9737890Z 
2025-05-07T20:33:10.9737996Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.9738417Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.9738835Z     T=4096,
2025-05-07T20:33:10.9739010Z     D=5120,
2025-05-07T20:33:10.9739199Z     scale_ub=None,
2025-05-07T20:33:10.9739412Z     contiguous=True,
2025-05-07T20:33:10.9739627Z     compiled=False,
2025-05-07T20:33:10.9739821Z )
2025-05-07T20:33:10.9740135Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.9740641Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:10.9740918Z 
2025-05-07T20:33:10.9741002Z     @given(
2025-05-07T20:33:10.9741223Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.9741530Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.9741834Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.9742160Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.9742489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.9742779Z     )
2025-05-07T20:33:10.9743127Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.9743579Z     def test_silu_mul_quant(
2025-05-07T20:33:10.9743810Z         self,
2025-05-07T20:33:10.9744002Z         T: int,
2025-05-07T20:33:10.9744289Z         D: int,
2025-05-07T20:33:10.9744508Z         scale_ub: Optional[float],
2025-05-07T20:33:10.9744783Z         contiguous: bool,
2025-05-07T20:33:10.9745027Z         compiled: bool,
2025-05-07T20:33:10.9745245Z     ) -> None:
2025-05-07T20:33:10.9745449Z         torch.manual_seed(2025)
2025-05-07T20:33:10.9745804Z     
2025-05-07T20:33:10.9746077Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.9748248Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.9750277Z 
2025-05-07T20:33:10.9750400Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.9750615Z 
2025-05-07T20:33:10.9750718Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.9751148Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.9751590Z     T=2048,
2025-05-07T20:33:10.9751775Z     D=5120,
2025-05-07T20:33:10.9751955Z     scale_ub=None,
2025-05-07T20:33:10.9752167Z     contiguous=False,
2025-05-07T20:33:10.9752384Z     compiled=False,
2025-05-07T20:33:10.9752585Z )
2025-05-07T20:33:10.9752900Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.9753404Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:10.9753686Z 
2025-05-07T20:33:10.9753768Z     @given(
2025-05-07T20:33:10.9753994Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.9754307Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.9754624Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.9754956Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.9755294Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.9755597Z     )
2025-05-07T20:33:10.9755959Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.9756416Z     def test_silu_mul_quant(
2025-05-07T20:33:10.9756658Z         self,
2025-05-07T20:33:10.9756854Z         T: int,
2025-05-07T20:33:10.9757048Z         D: int,
2025-05-07T20:33:10.9757268Z         scale_ub: Optional[float],
2025-05-07T20:33:10.9757540Z         contiguous: bool,
2025-05-07T20:33:10.9757785Z         compiled: bool,
2025-05-07T20:33:10.9758001Z     ) -> None:
2025-05-07T20:33:10.9758208Z         torch.manual_seed(2025)
2025-05-07T20:33:10.9758441Z     
2025-05-07T20:33:10.9758713Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.9760955Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.9762950Z 
2025-05-07T20:33:10.9763071Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.9763283Z 
2025-05-07T20:33:10.9763386Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.9763798Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.9764210Z     T=4096,
2025-05-07T20:33:10.9764403Z     D=7168,
2025-05-07T20:33:10.9764650Z     scale_ub=None,
2025-05-07T20:33:10.9764867Z     contiguous=True,
2025-05-07T20:33:10.9765094Z     compiled=True,
2025-05-07T20:33:10.9765289Z )
2025-05-07T20:33:10.9765619Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.9766205Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:10.9766485Z 
2025-05-07T20:33:10.9766567Z     @given(
2025-05-07T20:33:10.9766798Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.9767116Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.9767476Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.9767809Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.9768145Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.9768444Z     )
2025-05-07T20:33:10.9768798Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.9769254Z     def test_silu_mul_quant(
2025-05-07T20:33:10.9769510Z         self,
2025-05-07T20:33:10.9769701Z         T: int,
2025-05-07T20:33:10.9769901Z         D: int,
2025-05-07T20:33:10.9770120Z         scale_ub: Optional[float],
2025-05-07T20:33:10.9770389Z         contiguous: bool,
2025-05-07T20:33:10.9770632Z         compiled: bool,
2025-05-07T20:33:10.9770857Z     ) -> None:
2025-05-07T20:33:10.9771057Z         torch.manual_seed(2025)
2025-05-07T20:33:10.9771295Z     
2025-05-07T20:33:10.9771564Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.9773747Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.9775827Z 
2025-05-07T20:33:10.9775944Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.9776159Z 
2025-05-07T20:33:10.9776258Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.9776681Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.9777095Z     T=2048,
2025-05-07T20:33:10.9777280Z     D=5120,
2025-05-07T20:33:10.9777470Z     scale_ub=1200.0,
2025-05-07T20:33:10.9777696Z     contiguous=False,
2025-05-07T20:33:10.9777919Z     compiled=False,
2025-05-07T20:33:10.9778127Z )
2025-05-07T20:33:10.9778454Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.9778971Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:10.9779259Z 
2025-05-07T20:33:10.9779337Z     @given(
2025-05-07T20:33:10.9779567Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.9779886Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.9780191Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.9780525Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.9780860Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.9781152Z     )
2025-05-07T20:33:10.9781505Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.9781962Z     def test_silu_mul_quant(
2025-05-07T20:33:10.9782205Z         self,
2025-05-07T20:33:10.9782405Z         T: int,
2025-05-07T20:33:10.9782600Z         D: int,
2025-05-07T20:33:10.9782815Z         scale_ub: Optional[float],
2025-05-07T20:33:10.9783086Z         contiguous: bool,
2025-05-07T20:33:10.9783330Z         compiled: bool,
2025-05-07T20:33:10.9783547Z     ) -> None:
2025-05-07T20:33:10.9783751Z         torch.manual_seed(2025)
2025-05-07T20:33:10.9783984Z     
2025-05-07T20:33:10.9784301Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.9786573Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:10.9788611Z 
2025-05-07T20:33:10.9788733Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:10.9788983Z 
2025-05-07T20:33:10.9789112Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.9789536Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.9789957Z     T=4096,
2025-05-07T20:33:10.9790144Z     D=7168,
2025-05-07T20:33:10.9790333Z     scale_ub=1200.0,
2025-05-07T20:33:10.9790546Z     contiguous=True,
2025-05-07T20:33:10.9790758Z     compiled=False,
2025-05-07T20:33:10.9790955Z )
2025-05-07T20:33:11.0858249Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.0858938Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:11.0859406Z 
2025-05-07T20:33:11.0859527Z     @given(
2025-05-07T20:33:11.0859826Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.0860261Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.0860666Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.0860997Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.0861319Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.0861605Z     )
2025-05-07T20:33:11.0861952Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.0862399Z     def test_silu_mul_quant(
2025-05-07T20:33:11.0862635Z         self,
2025-05-07T20:33:11.0862824Z         T: int,
2025-05-07T20:33:11.0863006Z         D: int,
2025-05-07T20:33:11.0863214Z         scale_ub: Optional[float],
2025-05-07T20:33:11.0863485Z         contiguous: bool,
2025-05-07T20:33:11.0863717Z         compiled: bool,
2025-05-07T20:33:11.0863939Z     ) -> None:
2025-05-07T20:33:11.0864143Z         torch.manual_seed(2025)
2025-05-07T20:33:11.0864373Z     
2025-05-07T20:33:11.0864640Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.0866822Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.0868826Z 
2025-05-07T20:33:11.0868940Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.0869160Z 
2025-05-07T20:33:11.0869267Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.0869683Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.0870092Z     T=16384,
2025-05-07T20:33:11.0870284Z     D=7168,
2025-05-07T20:33:11.0870469Z     scale_ub=None,
2025-05-07T20:33:11.0870683Z     contiguous=False,
2025-05-07T20:33:11.0870901Z     compiled=True,
2025-05-07T20:33:11.0871091Z )
2025-05-07T20:33:11.0871403Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.0871906Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:11.0872188Z 
2025-05-07T20:33:11.0872378Z     @given(
2025-05-07T20:33:11.0872598Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.0872909Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.0873216Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.0873657Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.0873991Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.0874276Z     )
2025-05-07T20:33:11.0874615Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.0875125Z     def test_silu_mul_quant(
2025-05-07T20:33:11.0875364Z         self,
2025-05-07T20:33:11.0875555Z         T: int,
2025-05-07T20:33:11.0875739Z         D: int,
2025-05-07T20:33:11.0875949Z         scale_ub: Optional[float],
2025-05-07T20:33:11.0876216Z         contiguous: bool,
2025-05-07T20:33:11.0876444Z         compiled: bool,
2025-05-07T20:33:11.0876654Z     ) -> None:
2025-05-07T20:33:11.0876870Z         torch.manual_seed(2025)
2025-05-07T20:33:11.0877104Z     
2025-05-07T20:33:11.0877374Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.0879558Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.0881551Z 
2025-05-07T20:33:11.0881670Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.0881882Z 
2025-05-07T20:33:11.0881984Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.0882393Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.0882808Z     T=4096,
2025-05-07T20:33:11.0882987Z     D=7168,
2025-05-07T20:33:11.0883165Z     scale_ub=None,
2025-05-07T20:33:11.0883373Z     contiguous=True,
2025-05-07T20:33:11.0883586Z     compiled=False,
2025-05-07T20:33:11.0883779Z )
2025-05-07T20:33:11.0884098Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.0884601Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:11.0884881Z 
2025-05-07T20:33:11.0884960Z     @given(
2025-05-07T20:33:11.0885185Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.0885498Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.0893065Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.0893531Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.0893868Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.0894167Z     )
2025-05-07T20:33:11.0894625Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.0895083Z     def test_silu_mul_quant(
2025-05-07T20:33:11.0895329Z         self,
2025-05-07T20:33:11.0895519Z         T: int,
2025-05-07T20:33:11.0895710Z         D: int,
2025-05-07T20:33:11.0895935Z         scale_ub: Optional[float],
2025-05-07T20:33:11.0896204Z         contiguous: bool,
2025-05-07T20:33:11.0896440Z         compiled: bool,
2025-05-07T20:33:11.0896661Z     ) -> None:
2025-05-07T20:33:11.0896881Z         torch.manual_seed(2025)
2025-05-07T20:33:11.0897136Z     
2025-05-07T20:33:11.0897418Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.0899670Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.0901758Z 
2025-05-07T20:33:11.0901963Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.0902185Z 
2025-05-07T20:33:11.0902298Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.0902726Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.0903190Z     T=16384,
2025-05-07T20:33:11.0903386Z     D=7168,
2025-05-07T20:33:11.0903576Z     scale_ub=None,
2025-05-07T20:33:11.0903792Z     contiguous=True,
2025-05-07T20:33:11.0904016Z     compiled=False,
2025-05-07T20:33:11.0904221Z )
2025-05-07T20:33:11.0904546Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.0905056Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:11.0905347Z 
2025-05-07T20:33:11.0905425Z     @given(
2025-05-07T20:33:11.0905662Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.0905976Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.0906286Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.0906619Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.0906947Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.0907230Z     )
2025-05-07T20:33:11.0907579Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.0908036Z     def test_silu_mul_quant(
2025-05-07T20:33:11.0908271Z         self,
2025-05-07T20:33:11.0908466Z         T: int,
2025-05-07T20:33:11.0908660Z         D: int,
2025-05-07T20:33:11.0908874Z         scale_ub: Optional[float],
2025-05-07T20:33:11.0909142Z         contiguous: bool,
2025-05-07T20:33:11.0909382Z         compiled: bool,
2025-05-07T20:33:11.0909602Z     ) -> None:
2025-05-07T20:33:11.0909812Z         torch.manual_seed(2025)
2025-05-07T20:33:11.0910057Z     
2025-05-07T20:33:11.0910329Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.0912507Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.0914507Z 
2025-05-07T20:33:11.0914625Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.0914845Z 
2025-05-07T20:33:11.0914945Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.0915364Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.0915782Z     T=16384,
2025-05-07T20:33:11.0915974Z     D=7168,
2025-05-07T20:33:11.0916175Z     scale_ub=1200.0,
2025-05-07T20:33:11.0916398Z     contiguous=True,
2025-05-07T20:33:11.0916627Z     compiled=False,
2025-05-07T20:33:11.0916838Z )
2025-05-07T20:33:11.0917167Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.0917679Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:11.0917973Z 
2025-05-07T20:33:11.0918055Z     @given(
2025-05-07T20:33:11.0918287Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.0918601Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.0918916Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.0919261Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.0919605Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.0919954Z     )
2025-05-07T20:33:11.0920318Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.0920776Z     def test_silu_mul_quant(
2025-05-07T20:33:11.0921017Z         self,
2025-05-07T20:33:11.0921230Z         T: int,
2025-05-07T20:33:11.0921516Z         D: int,
2025-05-07T20:33:11.0921742Z         scale_ub: Optional[float],
2025-05-07T20:33:11.0922026Z         contiguous: bool,
2025-05-07T20:33:11.0922274Z         compiled: bool,
2025-05-07T20:33:11.0922499Z     ) -> None:
2025-05-07T20:33:11.0922750Z         torch.manual_seed(2025)
2025-05-07T20:33:11.0922991Z     
2025-05-07T20:33:11.0923262Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.0925707Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.0927713Z 
2025-05-07T20:33:11.0927830Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.0928049Z 
2025-05-07T20:33:11.0928156Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.0928580Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.0929046Z     T=128,
2025-05-07T20:33:11.0929232Z     D=5120,
2025-05-07T20:33:11.0929427Z     scale_ub=1200.0,
2025-05-07T20:33:11.0929653Z     contiguous=False,
2025-05-07T20:33:11.0929877Z     compiled=False,
2025-05-07T20:33:11.0930087Z )
2025-05-07T20:33:11.2213188Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.2213954Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:11.2214460Z 
2025-05-07T20:33:11.2214575Z     @given(
2025-05-07T20:33:11.2214881Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.2215258Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.2215567Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.2215893Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.2216218Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.2216509Z     )
2025-05-07T20:33:11.2216850Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.2217317Z     def test_silu_mul_quant(
2025-05-07T20:33:11.2217558Z         self,
2025-05-07T20:33:11.2217745Z         T: int,
2025-05-07T20:33:11.2217938Z         D: int,
2025-05-07T20:33:11.2218143Z         scale_ub: Optional[float],
2025-05-07T20:33:11.2218415Z         contiguous: bool,
2025-05-07T20:33:11.2218643Z         compiled: bool,
2025-05-07T20:33:11.2218859Z     ) -> None:
2025-05-07T20:33:11.2219068Z         torch.manual_seed(2025)
2025-05-07T20:33:11.2219306Z     
2025-05-07T20:33:11.2219570Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.2219926Z     
2025-05-07T20:33:11.2220119Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.2220407Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.2220719Z         x = x_sign * x_clamp
2025-05-07T20:33:11.2220961Z         x0 = x[:, :D]
2025-05-07T20:33:11.2221174Z         x1 = x[:, D:]
2025-05-07T20:33:11.2221375Z     
2025-05-07T20:33:11.2221556Z         if contiguous:
2025-05-07T20:33:11.2221784Z             x0 = x0.contiguous()
2025-05-07T20:33:11.2222041Z             x1 = x1.contiguous()
2025-05-07T20:33:11.2222283Z     
2025-05-07T20:33:11.2222465Z         if scale_ub is not None:
2025-05-07T20:33:11.2222734Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.2223212Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.2223522Z             )
2025-05-07T20:33:11.2223703Z         else:
2025-05-07T20:33:11.2223912Z             scale_ub_tensor = None
2025-05-07T20:33:11.2224165Z     
2025-05-07T20:33:11.2224511Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.2224838Z             op = silu_mul_quant
2025-05-07T20:33:11.2225101Z             if compiled:
2025-05-07T20:33:11.2225347Z                 op = torch.compile(op)
2025-05-07T20:33:11.2225828Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.2226183Z     
2025-05-07T20:33:11.2226374Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.2226547Z 
2025-05-07T20:33:11.2226646Z moe/activation_test.py:117: 
2025-05-07T20:33:11.2226944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.2227289Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.2227577Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.2228308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.2229036Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.2229598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.2230315Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.2231006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.2231573Z     kernel = self.compile(
2025-05-07T20:33:11.2232128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.2232809Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.2233219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.2233458Z 
2025-05-07T20:33:11.2233668Z self = <triton.compiler.compiler.ASTSource object at 0x7f9856971b20>
2025-05-07T20:33:11.2234799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.2236226Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98569cccc0>}
2025-05-07T20:33:11.2237625Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.2238701Z context = <triton._C.libtriton.ir.context object at 0x7f9856816c30>
2025-05-07T20:33:11.2239000Z 
2025-05-07T20:33:11.2239168Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.2239711Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.2240195Z                            module_map=module_map)
2025-05-07T20:33:11.2240576Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.2240932Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.2241203Z E       ^
2025-05-07T20:33:11.2241686Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.2242159Z 
2025-05-07T20:33:11.2242594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.2243140Z 
2025-05-07T20:33:11.2243243Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.2243668Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.2244166Z     T=2048,
2025-05-07T20:33:11.2244346Z     D=7168,
2025-05-07T20:33:11.2244538Z     scale_ub=None,
2025-05-07T20:33:11.2244758Z     contiguous=False,
2025-05-07T20:33:11.2244985Z     compiled=False,
2025-05-07T20:33:11.2245180Z )
2025-05-07T20:33:11.2245619Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.2246130Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:11.2246415Z 
2025-05-07T20:33:11.2246490Z     @given(
2025-05-07T20:33:11.2246719Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.2247073Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.2247379Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.2247706Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.2248041Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.2248339Z     )
2025-05-07T20:33:11.2248686Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.2249148Z     def test_silu_mul_quant(
2025-05-07T20:33:11.2249397Z         self,
2025-05-07T20:33:11.2249589Z         T: int,
2025-05-07T20:33:11.2249790Z         D: int,
2025-05-07T20:33:11.2250012Z         scale_ub: Optional[float],
2025-05-07T20:33:11.2250293Z         contiguous: bool,
2025-05-07T20:33:11.2250535Z         compiled: bool,
2025-05-07T20:33:11.2250756Z     ) -> None:
2025-05-07T20:33:11.2250961Z         torch.manual_seed(2025)
2025-05-07T20:33:11.2251199Z     
2025-05-07T20:33:11.2251469Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.2253659Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.2255691Z 
2025-05-07T20:33:11.2255817Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.2256030Z 
2025-05-07T20:33:11.2256128Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.2256541Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.2256953Z     T=128,
2025-05-07T20:33:11.2257132Z     D=7168,
2025-05-07T20:33:11.2257319Z     scale_ub=1200.0,
2025-05-07T20:33:11.2257533Z     contiguous=True,
2025-05-07T20:33:11.2257744Z     compiled=True,
2025-05-07T20:33:11.2257938Z )
2025-05-07T20:33:11.2572579Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.2573172Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:11.2573942Z 
2025-05-07T20:33:11.2574241Z     @given(
2025-05-07T20:33:11.2574688Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.2575130Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.2575542Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.2575982Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.2576309Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.2576597Z     )
2025-05-07T20:33:11.2576939Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.2577395Z     def test_silu_mul_quant(
2025-05-07T20:33:11.2577625Z         self,
2025-05-07T20:33:11.2577814Z         T: int,
2025-05-07T20:33:11.2578005Z         D: int,
2025-05-07T20:33:11.2578214Z         scale_ub: Optional[float],
2025-05-07T20:33:11.2578483Z         contiguous: bool,
2025-05-07T20:33:11.2578719Z         compiled: bool,
2025-05-07T20:33:11.2578930Z     ) -> None:
2025-05-07T20:33:11.2579262Z         torch.manual_seed(2025)
2025-05-07T20:33:11.2579506Z     
2025-05-07T20:33:11.2579774Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.2580118Z     
2025-05-07T20:33:11.2580305Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.2580702Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.2581011Z         x = x_sign * x_clamp
2025-05-07T20:33:11.2581244Z         x0 = x[:, :D]
2025-05-07T20:33:11.2581448Z         x1 = x[:, D:]
2025-05-07T20:33:11.2581648Z     
2025-05-07T20:33:11.2581824Z         if contiguous:
2025-05-07T20:33:11.2582108Z             x0 = x0.contiguous()
2025-05-07T20:33:11.2582360Z             x1 = x1.contiguous()
2025-05-07T20:33:11.2582602Z     
2025-05-07T20:33:11.2582784Z         if scale_ub is not None:
2025-05-07T20:33:11.2583054Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.2583384Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.2583693Z             )
2025-05-07T20:33:11.2583873Z         else:
2025-05-07T20:33:11.2584076Z             scale_ub_tensor = None
2025-05-07T20:33:11.2584321Z     
2025-05-07T20:33:11.2584541Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.2584853Z             op = silu_mul_quant
2025-05-07T20:33:11.2585105Z             if compiled:
2025-05-07T20:33:11.2585340Z                 op = torch.compile(op)
2025-05-07T20:33:11.2585634Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.2585910Z     
2025-05-07T20:33:11.2586093Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.2586262Z 
2025-05-07T20:33:11.2586358Z moe/activation_test.py:117: 
2025-05-07T20:33:11.2586652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.2586979Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.2587258Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.2587831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.2588413Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.2589085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.2589798Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.2590353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.2591057Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.2591743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.2592293Z     kernel = self.compile(
2025-05-07T20:33:11.2592846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.2593519Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.2593924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.2594158Z 
2025-05-07T20:33:11.2594364Z self = <triton.compiler.compiler.ASTSource object at 0x7f985685ac90>
2025-05-07T20:33:11.2595488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.2596906Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98569cda80>}
2025-05-07T20:33:11.2598317Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.2599442Z context = <triton._C.libtriton.ir.context object at 0x7f9856889170>
2025-05-07T20:33:11.2599792Z 
2025-05-07T20:33:11.2599964Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.2600501Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.2601052Z                            module_map=module_map)
2025-05-07T20:33:11.2601428Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.2601789Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.2602048Z E       ^
2025-05-07T20:33:11.2602526Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.2603065Z 
2025-05-07T20:33:11.2603504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.2604045Z 
2025-05-07T20:33:11.2604152Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.2604580Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.2604999Z     T=128,
2025-05-07T20:33:11.2605192Z     D=7168,
2025-05-07T20:33:11.2605378Z     scale_ub=1200.0,
2025-05-07T20:33:11.2605602Z     contiguous=True,
2025-05-07T20:33:11.2605820Z     compiled=False,
2025-05-07T20:33:11.2606019Z )
2025-05-07T20:33:11.2606347Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.2606857Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:11.2607142Z 
2025-05-07T20:33:11.2607230Z     @given(
2025-05-07T20:33:11.2607455Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.2607773Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.2608090Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.2608421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.2608756Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.2609054Z     )
2025-05-07T20:33:11.2609402Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.2609859Z     def test_silu_mul_quant(
2025-05-07T20:33:11.2610106Z         self,
2025-05-07T20:33:11.2610290Z         T: int,
2025-05-07T20:33:11.2610495Z         D: int,
2025-05-07T20:33:11.2610716Z         scale_ub: Optional[float],
2025-05-07T20:33:11.2610984Z         contiguous: bool,
2025-05-07T20:33:11.2611228Z         compiled: bool,
2025-05-07T20:33:11.2611449Z     ) -> None:
2025-05-07T20:33:11.2611660Z         torch.manual_seed(2025)
2025-05-07T20:33:11.2611890Z     
2025-05-07T20:33:11.2612163Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.2612515Z     
2025-05-07T20:33:11.2612702Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.2612987Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.2615273Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.2617261Z 
2025-05-07T20:33:11.2617381Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:11.2617598Z 
2025-05-07T20:33:11.2617706Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.2618118Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.2618527Z     T=128,
2025-05-07T20:33:11.2618713Z     D=5120,
2025-05-07T20:33:11.2618910Z     scale_ub=1200.0,
2025-05-07T20:33:11.2619133Z     contiguous=True,
2025-05-07T20:33:11.2619356Z     compiled=True,
2025-05-07T20:33:11.2619609Z )
2025-05-07T20:33:11.2619932Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.2620443Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:11.2620723Z 
2025-05-07T20:33:11.2620804Z     @given(
2025-05-07T20:33:11.2621121Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.2621443Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.2621755Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.2622087Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.2622474Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.2622767Z     )
2025-05-07T20:33:11.2623123Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.2623582Z     def test_silu_mul_quant(
2025-05-07T20:33:11.2623834Z         self,
2025-05-07T20:33:11.2624031Z         T: int,
2025-05-07T20:33:11.2624241Z         D: int,
2025-05-07T20:33:11.2624469Z         scale_ub: Optional[float],
2025-05-07T20:33:11.2624746Z         contiguous: bool,
2025-05-07T20:33:11.2624991Z         compiled: bool,
2025-05-07T20:33:11.2625211Z     ) -> None:
2025-05-07T20:33:11.2625629Z         torch.manual_seed(2025)
2025-05-07T20:33:11.2625910Z     
2025-05-07T20:33:11.2626199Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.2626549Z     
2025-05-07T20:33:11.2626745Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.2627039Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.2629208Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.2631184Z 
2025-05-07T20:33:11.2631303Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:11.2631518Z 
2025-05-07T20:33:11.2631622Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.2632039Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.2632456Z     T=128,
2025-05-07T20:33:11.2632645Z     D=7168,
2025-05-07T20:33:11.2632840Z     scale_ub=None,
2025-05-07T20:33:11.2633050Z     contiguous=True,
2025-05-07T20:33:11.2633272Z     compiled=True,
2025-05-07T20:33:11.2633470Z )
2025-05-07T20:33:11.7800991Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.7801884Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:11.7802309Z 
2025-05-07T20:33:11.7809049Z     @given(
2025-05-07T20:33:11.7809425Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.7809751Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.7810062Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.7810392Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.7810729Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.7811014Z     )
2025-05-07T20:33:11.7811377Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.7811829Z     def test_silu_mul_quant(
2025-05-07T20:33:11.7812075Z         self,
2025-05-07T20:33:11.7812272Z         T: int,
2025-05-07T20:33:11.7812477Z         D: int,
2025-05-07T20:33:11.7812699Z         scale_ub: Optional[float],
2025-05-07T20:33:11.7812970Z         contiguous: bool,
2025-05-07T20:33:11.7813208Z         compiled: bool,
2025-05-07T20:33:11.7813438Z     ) -> None:
2025-05-07T20:33:11.7813655Z         torch.manual_seed(2025)
2025-05-07T20:33:11.7814025Z     
2025-05-07T20:33:11.7814307Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.7816725Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.7818780Z 
2025-05-07T20:33:11.7818912Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.7819130Z 
2025-05-07T20:33:11.7848714Z FAILED
2025-05-07T20:33:11.7849105Z 
2025-05-07T20:33:11.7849495Z =================================== FAILURES ===================================
2025-05-07T20:33:11.7850045Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:11.7850690Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:11.7851475Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:33:11.7852118Z   |     yield
2025-05-07T20:33:11.7852646Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run
2025-05-07T20:33:11.7853319Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:11.7854104Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
2025-05-07T20:33:11.7854943Z   |     if method() is not None:
2025-05-07T20:33:11.7855217Z   |        ^^^^^^^^
2025-05-07T20:33:11.7856134Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:11.7857216Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.7857650Z   |            ^^^^^^^
2025-05-07T20:33:11.7858467Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:11.7859397Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:11.7860017Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:11.7860625Z   +-+---------------- 1 ----------------
2025-05-07T20:33:11.7861032Z     | Traceback (most recent call last):
2025-05-07T20:33:11.7862060Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:11.7863190Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.7863714Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:11.7866653Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.7869653Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:11.7870294Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.7870888Z     |     T=2048,
2025-05-07T20:33:11.7871212Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:11.7871693Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:11.7872427Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:11.7872956Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:11.7873381Z     | )
2025-05-07T20:33:11.7873639Z     | 
2025-05-07T20:33:11.7874605Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:11.7875481Z     +---------------- 2 ----------------
2025-05-07T20:33:11.7875900Z     | Traceback (most recent call last):
2025-05-07T20:33:11.7876928Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:11.7878165Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.7878689Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:11.7881651Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.7884530Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:11.7885140Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.7885568Z     |     T=128,
2025-05-07T20:33:11.7885780Z     |     D=7168,
2025-05-07T20:33:11.7886000Z     |     scale_ub=None,
2025-05-07T20:33:11.7886252Z     |     contiguous=True,
2025-05-07T20:33:11.7886496Z     |     compiled=True,
2025-05-07T20:33:11.7886732Z     | )
2025-05-07T20:33:11.7886937Z     | 
2025-05-07T20:33:11.7887484Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:11.7888133Z     +---------------- 3 ----------------
2025-05-07T20:33:11.7888447Z     | Traceback (most recent call last):
2025-05-07T20:33:11.7889966Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:11.7890801Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.7891206Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:11.7893346Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.7895568Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:11.7896023Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.7896459Z     |     T=128,
2025-05-07T20:33:11.7896675Z     |     D=5120,
2025-05-07T20:33:11.7896889Z     |     scale_ub=1200.0,
2025-05-07T20:33:11.7897145Z     |     contiguous=True,
2025-05-07T20:33:11.7897402Z     |     compiled=True,
2025-05-07T20:33:11.7897637Z     | )
2025-05-07T20:33:11.7897817Z     | 
2025-05-07T20:33:11.7898364Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:11.7899114Z     +---------------- 4 ----------------
2025-05-07T20:33:11.7899416Z     | Traceback (most recent call last):
2025-05-07T20:33:11.7900171Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:11.7901021Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:11.7901319Z     |                              ^^^^^^^^
2025-05-07T20:33:11.7901999Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:11.7902785Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.7903142Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:11.7903983Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:11.7904839Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:11.7905494Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:11.7906284Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.7906743Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:11.7907423Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:11.7908252Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:11.7908746Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:11.7909426Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:11.7910171Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:11.7910563Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:11.7911197Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:11.7911807Z     |     fn()
2025-05-07T20:33:11.7912415Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:11.7913095Z     |     self.fn.run(
2025-05-07T20:33:11.7913649Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:11.7914279Z     |     kernel = self.compile(
2025-05-07T20:33:11.7914561Z     |              ^^^^^^^^^^^^^
2025-05-07T20:33:11.7915396Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:11.7916422Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.7916979Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:11.7917915Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:11.7919075Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.7919806Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:11.7920355Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.7920843Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:11.7921224Z     | ^
2025-05-07T20:33:11.7921911Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.7922816Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:11.7923388Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:11.7924126Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.7924849Z     |     T=1,  # or any other generated value
2025-05-07T20:33:11.7925303Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:11.7926084Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:11.7926616Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:11.7927326Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:11.7927750Z     | )
2025-05-07T20:33:11.7928016Z     | 
2025-05-07T20:33:11.7928786Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:11.7929718Z     +------------------------------------
2025-05-07T20:33:11.7930223Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:11.7930770Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.7931362Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.7931939Z     T=1,
2025-05-07T20:33:11.7932218Z     D=5120,
2025-05-07T20:33:11.7932499Z     scale_ub=None,
2025-05-07T20:33:11.7932799Z     contiguous=True,
2025-05-07T20:33:11.7933124Z     compiled=True,
2025-05-07T20:33:11.7933418Z )
2025-05-07T20:33:11.7933869Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.7934670Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:11.7935059Z 
2025-05-07T20:33:11.7935174Z     @given(
2025-05-07T20:33:11.7935505Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.7935945Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.7936394Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.7936875Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.7937337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.7937749Z     )
2025-05-07T20:33:11.7938246Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.7938899Z     def test_silu_mul_quant(
2025-05-07T20:33:11.7939307Z         self,
2025-05-07T20:33:11.7939595Z         T: int,
2025-05-07T20:33:11.7939886Z         D: int,
2025-05-07T20:33:11.7940199Z         scale_ub: Optional[float],
2025-05-07T20:33:11.7940595Z         contiguous: bool,
2025-05-07T20:33:11.7940957Z         compiled: bool,
2025-05-07T20:33:11.7941277Z     ) -> None:
2025-05-07T20:33:11.7941588Z         torch.manual_seed(2025)
2025-05-07T20:33:11.7941937Z     
2025-05-07T20:33:11.7942315Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.7942814Z     
2025-05-07T20:33:11.7943092Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.7943474Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.7943915Z         x = x_sign * x_clamp
2025-05-07T20:33:11.7944257Z         x0 = x[:, :D]
2025-05-07T20:33:11.7944563Z         x1 = x[:, D:]
2025-05-07T20:33:11.7944857Z     
2025-05-07T20:33:11.7945130Z         if contiguous:
2025-05-07T20:33:11.7945461Z             x0 = x0.contiguous()
2025-05-07T20:33:11.7945841Z             x1 = x1.contiguous()
2025-05-07T20:33:11.7946191Z     
2025-05-07T20:33:11.7946462Z         if scale_ub is not None:
2025-05-07T20:33:11.7946865Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.7947347Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.7947791Z             )
2025-05-07T20:33:11.7948072Z         else:
2025-05-07T20:33:11.7948378Z             scale_ub_tensor = None
2025-05-07T20:33:11.7948761Z     
2025-05-07T20:33:11.7949082Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.7949541Z             op = silu_mul_quant
2025-05-07T20:33:11.7949995Z             if compiled:
2025-05-07T20:33:11.7950339Z                 op = torch.compile(op)
2025-05-07T20:33:11.7950753Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.7951154Z     
2025-05-07T20:33:11.7951427Z         y_fp8, y_scale = fn()
2025-05-07T20:33:11.7951979Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:11.7952395Z     
2025-05-07T20:33:11.7952714Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.7953180Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:11.7953652Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:11.7954088Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:11.7954582Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.7955016Z     
2025-05-07T20:33:11.7955307Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:11.7955582Z 
2025-05-07T20:33:11.7955724Z moe/activation_test.py:126: 
2025-05-07T20:33:11.7956152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.7956640Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:11.7957101Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.7958273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:11.7959370Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:11.7960135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.7961096Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.7962097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:11.7963151Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:11.7964190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:11.7965116Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:11.7966013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:11.7966762Z     fn()
2025-05-07T20:33:11.7967500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:11.7968351Z     self.fn.run(
2025-05-07T20:33:11.7969011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.7969756Z     kernel = self.compile(
2025-05-07T20:33:11.7970504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.7971400Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.7971964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.7972276Z 
2025-05-07T20:33:11.7972544Z self = <triton.compiler.compiler.ASTSource object at 0x7f994940fe00>
2025-05-07T20:33:11.7974006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.7975969Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f99497ecc20>}
2025-05-07T20:33:11.7977814Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.7979315Z context = <triton._C.libtriton.ir.context object at 0x7f9949d506f0>
2025-05-07T20:33:11.7979699Z 
2025-05-07T20:33:11.7979916Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.7980617Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.7981331Z                            module_map=module_map)
2025-05-07T20:33:11.7981816Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.7982284Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:11.7982714Z E       ^
2025-05-07T20:33:11.7983344Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.7983977Z 
2025-05-07T20:33:11.7984546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.7985288Z 
2025-05-07T20:33:11.7985431Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.7986019Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.7986608Z     T=2048,
2025-05-07T20:33:11.7986867Z     D=5120,
2025-05-07T20:33:11.7987141Z     scale_ub=1200.0,
2025-05-07T20:33:11.7987460Z     contiguous=True,
2025-05-07T20:33:11.7987779Z     compiled=False,
2025-05-07T20:33:11.7988076Z )
2025-05-07T20:33:11.7988516Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.7989251Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:11.7989648Z 
2025-05-07T20:33:11.7989750Z     @given(
2025-05-07T20:33:11.7990062Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.7990490Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.7990893Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.7991345Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.7991798Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.7992197Z     )
2025-05-07T20:33:11.7992677Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.7993296Z     def test_silu_mul_quant(
2025-05-07T20:33:11.7993625Z         self,
2025-05-07T20:33:11.7993888Z         T: int,
2025-05-07T20:33:11.7994170Z         D: int,
2025-05-07T20:33:11.7994459Z         scale_ub: Optional[float],
2025-05-07T20:33:11.7994837Z         contiguous: bool,
2025-05-07T20:33:11.7995169Z         compiled: bool,
2025-05-07T20:33:11.7995470Z     ) -> None:
2025-05-07T20:33:11.7995763Z         torch.manual_seed(2025)
2025-05-07T20:33:11.7996086Z     
2025-05-07T20:33:11.7996441Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.7996908Z     
2025-05-07T20:33:11.7997164Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.7997552Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.7997961Z         x = x_sign * x_clamp
2025-05-07T20:33:11.7998294Z         x0 = x[:, :D]
2025-05-07T20:33:11.7998587Z         x1 = x[:, D:]
2025-05-07T20:33:11.7998858Z     
2025-05-07T20:33:11.7999106Z         if contiguous:
2025-05-07T20:33:11.7999418Z             x0 = x0.contiguous()
2025-05-07T20:33:11.7999761Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8000084Z     
2025-05-07T20:33:11.8000358Z         if scale_ub is not None:
2025-05-07T20:33:11.8000733Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8001188Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8001615Z             )
2025-05-07T20:33:11.8001886Z         else:
2025-05-07T20:33:11.8002189Z             scale_ub_tensor = None
2025-05-07T20:33:11.8002560Z     
2025-05-07T20:33:11.8002881Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8004864Z             op = silu_mul_quant
2025-05-07T20:33:11.8005240Z             if compiled:
2025-05-07T20:33:11.8005595Z                 op = torch.compile(op)
2025-05-07T20:33:11.8006066Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8006475Z     
2025-05-07T20:33:11.8006739Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8006965Z 
2025-05-07T20:33:11.8007115Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8007620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8008088Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8008474Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8009444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8010448Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8011186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8012112Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8013025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8013789Z     kernel = self.compile(
2025-05-07T20:33:11.8014701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8015675Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8016233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8016559Z 
2025-05-07T20:33:11.8016853Z self = <triton.compiler.compiler.ASTSource object at 0x7f994998ad80>
2025-05-07T20:33:11.8018383Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8020280Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f99498a8180>}
2025-05-07T20:33:11.8022233Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8023755Z context = <triton._C.libtriton.ir.context object at 0x7f99493e14b0>
2025-05-07T20:33:11.8024172Z 
2025-05-07T20:33:11.8024418Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8025173Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8026096Z                            module_map=module_map)
2025-05-07T20:33:11.8026627Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8027125Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8027512Z E       ^
2025-05-07T20:33:11.8028200Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8028827Z 
2025-05-07T20:33:11.8029451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8030160Z 
2025-05-07T20:33:11.8030305Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8030883Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8031431Z     T=2048,
2025-05-07T20:33:11.8031679Z     D=5120,
2025-05-07T20:33:11.8052866Z     scale_ub=1200.0,
2025-05-07T20:33:11.8053194Z     contiguous=True,
2025-05-07T20:33:11.8053478Z     compiled=True,
2025-05-07T20:33:11.8053731Z )
2025-05-07T20:33:11.8054166Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8055029Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:11.8055414Z 
2025-05-07T20:33:11.8055531Z     @given(
2025-05-07T20:33:11.8056089Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8056545Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8056975Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8057437Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8058133Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8058540Z     )
2025-05-07T20:33:11.8059029Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8059679Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8060114Z         self,
2025-05-07T20:33:11.8060399Z         T: int,
2025-05-07T20:33:11.8060686Z         D: int,
2025-05-07T20:33:11.8060987Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8061393Z         contiguous: bool,
2025-05-07T20:33:11.8061753Z         compiled: bool,
2025-05-07T20:33:11.8062066Z     ) -> None:
2025-05-07T20:33:11.8062362Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8062703Z     
2025-05-07T20:33:11.8063066Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8063550Z     
2025-05-07T20:33:11.8063820Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8064211Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8064630Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8064972Z         x0 = x[:, :D]
2025-05-07T20:33:11.8065266Z         x1 = x[:, D:]
2025-05-07T20:33:11.8065540Z     
2025-05-07T20:33:11.8065798Z         if contiguous:
2025-05-07T20:33:11.8066103Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8066437Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8066757Z     
2025-05-07T20:33:11.8067003Z         if scale_ub is not None:
2025-05-07T20:33:11.8067353Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8067800Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8068217Z             )
2025-05-07T20:33:11.8068458Z         else:
2025-05-07T20:33:11.8068731Z             scale_ub_tensor = None
2025-05-07T20:33:11.8069089Z     
2025-05-07T20:33:11.8069418Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8069854Z             op = silu_mul_quant
2025-05-07T20:33:11.8070199Z             if compiled:
2025-05-07T20:33:11.8070543Z                 op = torch.compile(op)
2025-05-07T20:33:11.8070953Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8071342Z     
2025-05-07T20:33:11.8071608Z         y_fp8, y_scale = fn()
2025-05-07T20:33:11.8071990Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:11.8072381Z     
2025-05-07T20:33:11.8072679Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8073015Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:11.8073314Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:11.8073640Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:11.8074001Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8074314Z     
2025-05-07T20:33:11.8074511Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:11.8074708Z 
2025-05-07T20:33:11.8074815Z moe/activation_test.py:126: 
2025-05-07T20:33:11.8075103Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8075450Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:11.8075782Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8076598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:11.8077392Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:11.8077957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8078677Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8079383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:11.8080214Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:11.8081058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:11.8081739Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:11.8082361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:11.8082950Z     fn()
2025-05-07T20:33:11.8083483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:11.8084088Z     self.fn.run(
2025-05-07T20:33:11.8084562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8085121Z     kernel = self.compile(
2025-05-07T20:33:11.8085676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8086363Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8086773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8087012Z 
2025-05-07T20:33:11.8087225Z self = <triton.compiler.compiler.ASTSource object at 0x7f9943d04320>
2025-05-07T20:33:11.8088343Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8089839Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9948439580>}
2025-05-07T20:33:11.8091246Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8092332Z context = <triton._C.libtriton.ir.context object at 0x7f9943b553b0>
2025-05-07T20:33:11.8092628Z 
2025-05-07T20:33:11.8092804Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8093345Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8093826Z                            module_map=module_map)
2025-05-07T20:33:11.8094196Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8094691Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:11.8094966Z E       ^
2025-05-07T20:33:11.8095442Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8095913Z 
2025-05-07T20:33:11.8096363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8096908Z 
2025-05-07T20:33:11.8097012Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8097439Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8097861Z     T=16384,
2025-05-07T20:33:11.8098052Z     D=7168,
2025-05-07T20:33:11.8098257Z     scale_ub=1200.0,
2025-05-07T20:33:11.8098486Z     contiguous=False,
2025-05-07T20:33:11.8098712Z     compiled=False,
2025-05-07T20:33:11.8098929Z )
2025-05-07T20:33:11.8099292Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8099814Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:11.8100104Z 
2025-05-07T20:33:11.8100183Z     @given(
2025-05-07T20:33:11.8100415Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8100731Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8101086Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8101418Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8101747Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8102038Z     )
2025-05-07T20:33:11.8102466Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8102925Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8103174Z         self,
2025-05-07T20:33:11.8103367Z         T: int,
2025-05-07T20:33:11.8103562Z         D: int,
2025-05-07T20:33:11.8103811Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8104074Z         contiguous: bool,
2025-05-07T20:33:11.8104312Z         compiled: bool,
2025-05-07T20:33:11.8104531Z     ) -> None:
2025-05-07T20:33:11.8104740Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8104978Z     
2025-05-07T20:33:11.8105257Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8105611Z     
2025-05-07T20:33:11.8105819Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8106117Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8106434Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8106686Z         x0 = x[:, :D]
2025-05-07T20:33:11.8106906Z         x1 = x[:, D:]
2025-05-07T20:33:11.8107108Z     
2025-05-07T20:33:11.8107301Z         if contiguous:
2025-05-07T20:33:11.8107541Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8107802Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8108040Z     
2025-05-07T20:33:11.8108239Z         if scale_ub is not None:
2025-05-07T20:33:11.8108529Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8108861Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8109180Z             )
2025-05-07T20:33:11.8109368Z         else:
2025-05-07T20:33:11.8109569Z             scale_ub_tensor = None
2025-05-07T20:33:11.8109828Z     
2025-05-07T20:33:11.8110066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8110389Z             op = silu_mul_quant
2025-05-07T20:33:11.8110649Z             if compiled:
2025-05-07T20:33:11.8110904Z                 op = torch.compile(op)
2025-05-07T20:33:11.8111202Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8111491Z     
2025-05-07T20:33:11.8111694Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8111862Z 
2025-05-07T20:33:11.8111973Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8112276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8112622Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8112915Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8113635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8114366Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8114931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8115655Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8116347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8116914Z     kernel = self.compile(
2025-05-07T20:33:11.8117490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8118176Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8118595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8118840Z 
2025-05-07T20:33:11.8119050Z self = <triton.compiler.compiler.ASTSource object at 0x7f9943c8fbc0>
2025-05-07T20:33:11.8120181Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8121658Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9948439c60>}
2025-05-07T20:33:11.8123149Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8124244Z context = <triton._C.libtriton.ir.context object at 0x7f9943b92e30>
2025-05-07T20:33:11.8124580Z 
2025-05-07T20:33:11.8124761Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8125296Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8126134Z                            module_map=module_map)
2025-05-07T20:33:11.8126516Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8126891Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8127153Z E       ^
2025-05-07T20:33:11.8127635Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8128105Z 
2025-05-07T20:33:11.8128556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8129101Z 
2025-05-07T20:33:11.8129214Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8129632Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8130066Z     T=1,
2025-05-07T20:33:11.8130260Z     D=7168,
2025-05-07T20:33:11.8130454Z     scale_ub=None,
2025-05-07T20:33:11.8130679Z     contiguous=True,
2025-05-07T20:33:11.8130907Z     compiled=True,
2025-05-07T20:33:11.8131102Z )
2025-05-07T20:33:11.8131431Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8131939Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:11.8132209Z 
2025-05-07T20:33:11.8132286Z     @given(
2025-05-07T20:33:11.8132515Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8132842Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8133160Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8133489Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8133828Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8134130Z     )
2025-05-07T20:33:11.8134552Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8135013Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8135251Z         self,
2025-05-07T20:33:11.8135448Z         T: int,
2025-05-07T20:33:11.8135648Z         D: int,
2025-05-07T20:33:11.8135860Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8136137Z         contiguous: bool,
2025-05-07T20:33:11.8136379Z         compiled: bool,
2025-05-07T20:33:11.8136593Z     ) -> None:
2025-05-07T20:33:11.8136812Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8137053Z     
2025-05-07T20:33:11.8137324Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8137679Z     
2025-05-07T20:33:11.8137877Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8138163Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8138489Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8138739Z         x0 = x[:, :D]
2025-05-07T20:33:11.8138972Z         x1 = x[:, D:]
2025-05-07T20:33:11.8139222Z     
2025-05-07T20:33:11.8139413Z         if contiguous:
2025-05-07T20:33:11.8139651Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8139913Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8140164Z     
2025-05-07T20:33:11.8140361Z         if scale_ub is not None:
2025-05-07T20:33:11.8140636Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8141101Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8141419Z             )
2025-05-07T20:33:11.8141612Z         else:
2025-05-07T20:33:11.8141821Z             scale_ub_tensor = None
2025-05-07T20:33:11.8142075Z     
2025-05-07T20:33:11.8142452Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8142778Z             op = silu_mul_quant
2025-05-07T20:33:11.8143030Z             if compiled:
2025-05-07T20:33:11.8143273Z                 op = torch.compile(op)
2025-05-07T20:33:11.8143573Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8143917Z     
2025-05-07T20:33:11.8144103Z         y_fp8, y_scale = fn()
2025-05-07T20:33:11.8144395Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:11.8144694Z     
2025-05-07T20:33:11.8144939Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8145282Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:11.8145584Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:11.8145912Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:11.8146273Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8146595Z     
2025-05-07T20:33:11.8146795Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:11.8146999Z 
2025-05-07T20:33:11.8147103Z moe/activation_test.py:126: 
2025-05-07T20:33:11.8147410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8147751Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:11.8148087Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8148918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:11.8149748Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:11.8150322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8151037Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8151764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:11.8152531Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:11.8153304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:11.8153969Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:11.8154606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:11.8155151Z     fn()
2025-05-07T20:33:11.8155688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:11.8156295Z     self.fn.run(
2025-05-07T20:33:11.8156785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8157344Z     kernel = self.compile(
2025-05-07T20:33:11.8157900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8158599Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8159040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8159300Z 
2025-05-07T20:33:11.8159515Z self = <triton.compiler.compiler.ASTSource object at 0x7f9943f7e180>
2025-05-07T20:33:11.8160635Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8162064Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f994843ad40>}
2025-05-07T20:33:11.8163630Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8164719Z context = <triton._C.libtriton.ir.context object at 0x7f9943a8a8f0>
2025-05-07T20:33:11.8165017Z 
2025-05-07T20:33:11.8165194Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8165768Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8166254Z                            module_map=module_map)
2025-05-07T20:33:11.8166625Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8166984Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:11.8167263Z E       ^
2025-05-07T20:33:11.8167740Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8168215Z 
2025-05-07T20:33:11.8168657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8169201Z 
2025-05-07T20:33:11.8169308Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8169736Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8170154Z     T=4096,
2025-05-07T20:33:11.8170334Z     D=5120,
2025-05-07T20:33:11.8170532Z     scale_ub=None,
2025-05-07T20:33:11.8170749Z     contiguous=False,
2025-05-07T20:33:11.8170970Z     compiled=False,
2025-05-07T20:33:11.8171175Z )
2025-05-07T20:33:11.8171498Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8172002Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:11.8172296Z 
2025-05-07T20:33:11.8172377Z     @given(
2025-05-07T20:33:11.8172606Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8172922Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8173233Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8173569Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8173908Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8174193Z     )
2025-05-07T20:33:11.8174619Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8175076Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8175321Z         self,
2025-05-07T20:33:11.8175521Z         T: int,
2025-05-07T20:33:11.8175721Z         D: int,
2025-05-07T20:33:11.8175933Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8176211Z         contiguous: bool,
2025-05-07T20:33:11.8176451Z         compiled: bool,
2025-05-07T20:33:11.8176676Z     ) -> None:
2025-05-07T20:33:11.8176885Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8177132Z     
2025-05-07T20:33:11.8177411Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8177762Z     
2025-05-07T20:33:11.8177958Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8178248Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8178561Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8178802Z         x0 = x[:, :D]
2025-05-07T20:33:11.8179015Z         x1 = x[:, D:]
2025-05-07T20:33:11.8179215Z     
2025-05-07T20:33:11.8179398Z         if contiguous:
2025-05-07T20:33:11.8179629Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8179885Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8180135Z     
2025-05-07T20:33:11.8180330Z         if scale_ub is not None:
2025-05-07T20:33:11.8180601Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8180940Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8181254Z             )
2025-05-07T20:33:11.8181511Z         else:
2025-05-07T20:33:11.8181718Z             scale_ub_tensor = None
2025-05-07T20:33:11.8181975Z     
2025-05-07T20:33:11.8182208Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8182524Z             op = silu_mul_quant
2025-05-07T20:33:11.8182771Z             if compiled:
2025-05-07T20:33:11.8183095Z                 op = torch.compile(op)
2025-05-07T20:33:11.8183394Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8183671Z     
2025-05-07T20:33:11.8183864Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8184027Z 
2025-05-07T20:33:11.8184165Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8184462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8184803Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8185083Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8185799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8186528Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8187090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8187798Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8188498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8189066Z     kernel = self.compile(
2025-05-07T20:33:11.8189672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8190354Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8190760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8190993Z 
2025-05-07T20:33:11.8191209Z self = <triton.compiler.compiler.ASTSource object at 0x7f99484d9e50>
2025-05-07T20:33:11.8192333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8193753Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f994386c4a0>}
2025-05-07T20:33:11.8195160Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8196249Z context = <triton._C.libtriton.ir.context object at 0x7f9943aa01f0>
2025-05-07T20:33:11.8196549Z 
2025-05-07T20:33:11.8196726Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8197259Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8197745Z                            module_map=module_map)
2025-05-07T20:33:11.8198116Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8198479Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8198738Z E       ^
2025-05-07T20:33:11.8199222Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8199691Z 
2025-05-07T20:33:11.8200131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8200674Z 
2025-05-07T20:33:11.8200778Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8201204Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8201616Z     T=4096,
2025-05-07T20:33:11.8201806Z     D=7168,
2025-05-07T20:33:11.8201991Z     scale_ub=None,
2025-05-07T20:33:11.8202205Z     contiguous=False,
2025-05-07T20:33:11.8202482Z     compiled=False,
2025-05-07T20:33:11.8202679Z )
2025-05-07T20:33:11.8202999Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8203509Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:11.8203791Z 
2025-05-07T20:33:11.8203943Z     @given(
2025-05-07T20:33:11.8204174Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8204490Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8204792Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8205165Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8205499Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8205785Z     )
2025-05-07T20:33:11.8206128Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8206580Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8206825Z         self,
2025-05-07T20:33:11.8207017Z         T: int,
2025-05-07T20:33:11.8207213Z         D: int,
2025-05-07T20:33:11.8207432Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8207703Z         contiguous: bool,
2025-05-07T20:33:11.8207943Z         compiled: bool,
2025-05-07T20:33:11.8208169Z     ) -> None:
2025-05-07T20:33:11.8208384Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8217044Z     
2025-05-07T20:33:11.8217364Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8217730Z     
2025-05-07T20:33:11.8217935Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8218244Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8218577Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8218830Z         x0 = x[:, :D]
2025-05-07T20:33:11.8219050Z         x1 = x[:, D:]
2025-05-07T20:33:11.8219302Z     
2025-05-07T20:33:11.8219516Z         if contiguous:
2025-05-07T20:33:11.8219746Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8220012Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8220266Z     
2025-05-07T20:33:11.8220457Z         if scale_ub is not None:
2025-05-07T20:33:11.8220738Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8221080Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8221395Z             )
2025-05-07T20:33:11.8221601Z         else:
2025-05-07T20:33:11.8221820Z             scale_ub_tensor = None
2025-05-07T20:33:11.8222083Z     
2025-05-07T20:33:11.8222314Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8222643Z             op = silu_mul_quant
2025-05-07T20:33:11.8222901Z             if compiled:
2025-05-07T20:33:11.8223147Z                 op = torch.compile(op)
2025-05-07T20:33:11.8223455Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8223741Z     
2025-05-07T20:33:11.8223936Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8224109Z 
2025-05-07T20:33:11.8224211Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8224516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8224856Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8225145Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8226237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8226984Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8227545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8228277Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8228987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8229550Z     kernel = self.compile(
2025-05-07T20:33:11.8230126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8231005Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8231432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8231676Z 
2025-05-07T20:33:11.8231890Z self = <triton.compiler.compiler.ASTSource object at 0x7f99484d9af0>
2025-05-07T20:33:11.8233160Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8234663Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f994386df80>}
2025-05-07T20:33:11.8236085Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8237183Z context = <triton._C.libtriton.ir.context object at 0x7f9943ab45b0>
2025-05-07T20:33:11.8237485Z 
2025-05-07T20:33:11.8237659Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8238214Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8238710Z                            module_map=module_map)
2025-05-07T20:33:11.8239125Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8239503Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8239781Z E       ^
2025-05-07T20:33:11.8240269Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8240746Z 
2025-05-07T20:33:11.8241188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8241744Z 
2025-05-07T20:33:11.8241855Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8242287Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8242709Z     T=128,
2025-05-07T20:33:11.8242897Z     D=7168,
2025-05-07T20:33:11.8243102Z     scale_ub=None,
2025-05-07T20:33:11.8243338Z     contiguous=False,
2025-05-07T20:33:11.8243564Z     compiled=True,
2025-05-07T20:33:11.8243773Z )
2025-05-07T20:33:11.8244104Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8244611Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:11.8244903Z 
2025-05-07T20:33:11.8244986Z     @given(
2025-05-07T20:33:11.8245223Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8245549Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8245873Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8246224Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8246584Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8246877Z     )
2025-05-07T20:33:11.8247243Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8247711Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8247962Z         self,
2025-05-07T20:33:11.8248172Z         T: int,
2025-05-07T20:33:11.8248376Z         D: int,
2025-05-07T20:33:11.8248595Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8248878Z         contiguous: bool,
2025-05-07T20:33:11.8249126Z         compiled: bool,
2025-05-07T20:33:11.8249355Z     ) -> None:
2025-05-07T20:33:11.8249578Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8249830Z     
2025-05-07T20:33:11.8250114Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8250473Z     
2025-05-07T20:33:11.8250677Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8250970Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8251355Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8251613Z         x0 = x[:, :D]
2025-05-07T20:33:11.8251844Z         x1 = x[:, D:]
2025-05-07T20:33:11.8252061Z     
2025-05-07T20:33:11.8252265Z         if contiguous:
2025-05-07T20:33:11.8252513Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8252859Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8253124Z     
2025-05-07T20:33:11.8253333Z         if scale_ub is not None:
2025-05-07T20:33:11.8253613Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8253967Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8254419Z             )
2025-05-07T20:33:11.8254622Z         else:
2025-05-07T20:33:11.8254847Z             scale_ub_tensor = None
2025-05-07T20:33:11.8255110Z     
2025-05-07T20:33:11.8255345Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8255675Z             op = silu_mul_quant
2025-05-07T20:33:11.8255941Z             if compiled:
2025-05-07T20:33:11.8256194Z                 op = torch.compile(op)
2025-05-07T20:33:11.8256505Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8256791Z     
2025-05-07T20:33:11.8256985Z         y_fp8, y_scale = fn()
2025-05-07T20:33:11.8257278Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:11.8257581Z     
2025-05-07T20:33:11.8257835Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8258175Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:11.8258481Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:11.8258813Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:11.8259233Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8259560Z     
2025-05-07T20:33:11.8259772Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:11.8259973Z 
2025-05-07T20:33:11.8260076Z moe/activation_test.py:126: 
2025-05-07T20:33:11.8260380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8260733Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:11.8261070Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8261895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:11.8262697Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:11.8263273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8263990Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8264724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:11.8265492Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:11.8266273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:11.8266950Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:11.8267588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:11.8268140Z     fn()
2025-05-07T20:33:11.8268685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:11.8269302Z     self.fn.run(
2025-05-07T20:33:11.8269797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8270368Z     kernel = self.compile(
2025-05-07T20:33:11.8270933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8271632Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8272053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8272350Z 
2025-05-07T20:33:11.8272573Z self = <triton.compiler.compiler.ASTSource object at 0x7f99498af320>
2025-05-07T20:33:11.8273779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8275225Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f994386ec00>}
2025-05-07T20:33:11.8276690Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8277785Z context = <triton._C.libtriton.ir.context object at 0x7f994312ce70>
2025-05-07T20:33:11.8277794Z 
2025-05-07T20:33:11.8277965Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8278247Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8278359Z                            module_map=module_map)
2025-05-07T20:33:11.8278531Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8278644Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:11.8278724Z E       ^
2025-05-07T20:33:11.8279097Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8279105Z 
2025-05-07T20:33:11.8279552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8279557Z 
2025-05-07T20:33:11.8279664Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8279905Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8279995Z     T=128,
2025-05-07T20:33:11.8280076Z     D=7168,
2025-05-07T20:33:11.8280169Z     scale_ub=None,
2025-05-07T20:33:11.8280261Z     contiguous=False,
2025-05-07T20:33:11.8280350Z     compiled=False,
2025-05-07T20:33:11.8280433Z )
2025-05-07T20:33:11.8280665Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8280854Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:11.8280859Z 
2025-05-07T20:33:11.8280940Z     @given(
2025-05-07T20:33:11.8281062Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8281171Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8281288Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8281406Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8281527Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8281604Z     )
2025-05-07T20:33:11.8281860Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8281967Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8282046Z         self,
2025-05-07T20:33:11.8282129Z         T: int,
2025-05-07T20:33:11.8282212Z         D: int,
2025-05-07T20:33:11.8282312Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8282415Z         contiguous: bool,
2025-05-07T20:33:11.8282503Z         compiled: bool,
2025-05-07T20:33:11.8282582Z     ) -> None:
2025-05-07T20:33:11.8282687Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8282763Z     
2025-05-07T20:33:11.8282943Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8283028Z     
2025-05-07T20:33:11.8283122Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8283250Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8283349Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8283435Z         x0 = x[:, :D]
2025-05-07T20:33:11.8283521Z         x1 = x[:, D:]
2025-05-07T20:33:11.8283654Z     
2025-05-07T20:33:11.8283747Z         if contiguous:
2025-05-07T20:33:11.8283854Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8283947Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8284027Z     
2025-05-07T20:33:11.8284126Z         if scale_ub is not None:
2025-05-07T20:33:11.8284345Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8284487Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8284576Z             )
2025-05-07T20:33:11.8284658Z         else:
2025-05-07T20:33:11.8284756Z             scale_ub_tensor = None
2025-05-07T20:33:11.8284905Z     
2025-05-07T20:33:11.8285037Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8285129Z             op = silu_mul_quant
2025-05-07T20:33:11.8285221Z             if compiled:
2025-05-07T20:33:11.8285323Z                 op = torch.compile(op)
2025-05-07T20:33:11.8285436Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8285510Z     
2025-05-07T20:33:11.8285608Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8285612Z 
2025-05-07T20:33:11.8285718Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8285853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8285957Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8286072Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8286601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8286707Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8287093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8287324Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8287692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8287795Z     kernel = self.compile(
2025-05-07T20:33:11.8288200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8288388Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8288526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8288531Z 
2025-05-07T20:33:11.8288748Z self = <triton.compiler.compiler.ASTSource object at 0x7f994366d2e0>
2025-05-07T20:33:11.8289565Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8290086Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9943d379c0>}
2025-05-07T20:33:11.8290889Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8291086Z context = <triton._C.libtriton.ir.context object at 0x7f994378cdf0>
2025-05-07T20:33:11.8291095Z 
2025-05-07T20:33:11.8291274Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8291552Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8291664Z                            module_map=module_map)
2025-05-07T20:33:11.8291843Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8291948Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8292027Z E       ^
2025-05-07T20:33:11.8292404Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8292458Z 
2025-05-07T20:33:11.8292897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8292902Z 
2025-05-07T20:33:11.8293013Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8293317Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8293404Z     T=4096,
2025-05-07T20:33:11.8293504Z     D=5120,
2025-05-07T20:33:11.8293595Z     scale_ub=1200.0,
2025-05-07T20:33:11.8293695Z     contiguous=True,
2025-05-07T20:33:11.8293785Z     compiled=False,
2025-05-07T20:33:11.8293904Z )
2025-05-07T20:33:11.8294141Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8294322Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:11.8294327Z 
2025-05-07T20:33:11.8294495Z     @given(
2025-05-07T20:33:11.8294619Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8294724Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8294858Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8294982Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8295099Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8295188Z     )
2025-05-07T20:33:11.8295451Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8295557Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8295639Z         self,
2025-05-07T20:33:11.8295722Z         T: int,
2025-05-07T20:33:11.8295812Z         D: int,
2025-05-07T20:33:11.8295918Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8296014Z         contiguous: bool,
2025-05-07T20:33:11.8296111Z         compiled: bool,
2025-05-07T20:33:11.8296192Z     ) -> None:
2025-05-07T20:33:11.8296291Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8296374Z     
2025-05-07T20:33:11.8296547Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8296630Z     
2025-05-07T20:33:11.8296733Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8296863Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8296958Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8297056Z         x0 = x[:, :D]
2025-05-07T20:33:11.8297140Z         x1 = x[:, D:]
2025-05-07T20:33:11.8297228Z     
2025-05-07T20:33:11.8297317Z         if contiguous:
2025-05-07T20:33:11.8297416Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8297520Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8297599Z     
2025-05-07T20:33:11.8297697Z         if scale_ub is not None:
2025-05-07T20:33:11.8297814Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8297954Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8298037Z             )
2025-05-07T20:33:11.8298130Z         else:
2025-05-07T20:33:11.8298231Z             scale_ub_tensor = None
2025-05-07T20:33:11.8298310Z     
2025-05-07T20:33:11.8298449Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8298548Z             op = silu_mul_quant
2025-05-07T20:33:11.8298649Z             if compiled:
2025-05-07T20:33:11.8298754Z                 op = torch.compile(op)
2025-05-07T20:33:11.8298866Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8298950Z     
2025-05-07T20:33:11.8299050Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8299055Z 
2025-05-07T20:33:11.8299156Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8299300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8299411Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8299516Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8300050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8300149Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8300534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8300816Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8301175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8301349Z     kernel = self.compile(
2025-05-07T20:33:11.8301758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8301936Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8302115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8302120Z 
2025-05-07T20:33:11.8302329Z self = <triton.compiler.compiler.ASTSource object at 0x7f99438a4740>
2025-05-07T20:33:11.8303148Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8303664Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f99482e2520>}
2025-05-07T20:33:11.8304472Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8304671Z context = <triton._C.libtriton.ir.context object at 0x7f994288b730>
2025-05-07T20:33:11.8304676Z 
2025-05-07T20:33:11.8304847Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8305124Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8305233Z                            module_map=module_map)
2025-05-07T20:33:11.8305404Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8305504Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8305584Z E       ^
2025-05-07T20:33:11.8305959Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8305968Z 
2025-05-07T20:33:11.8306403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8306407Z 
2025-05-07T20:33:11.8306520Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8306752Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8306833Z     T=1,
2025-05-07T20:33:11.8306919Z     D=5120,
2025-05-07T20:33:11.8307002Z     scale_ub=None,
2025-05-07T20:33:11.8307091Z     contiguous=True,
2025-05-07T20:33:11.8307181Z     compiled=True,
2025-05-07T20:33:11.8307256Z )
2025-05-07T20:33:11.8307479Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8307654Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:11.8307659Z 
2025-05-07T20:33:11.8307738Z     @given(
2025-05-07T20:33:11.8307860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8307973Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8308091Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8308215Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8308330Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8308407Z     )
2025-05-07T20:33:11.8308669Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8308765Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8308844Z         self,
2025-05-07T20:33:11.8308930Z         T: int,
2025-05-07T20:33:11.8309008Z         D: int,
2025-05-07T20:33:11.8309110Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8309260Z         contiguous: bool,
2025-05-07T20:33:11.8309349Z         compiled: bool,
2025-05-07T20:33:11.8309429Z     ) -> None:
2025-05-07T20:33:11.8309531Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8309606Z     
2025-05-07T20:33:11.8309785Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8309936Z     
2025-05-07T20:33:11.8310033Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8310165Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8310260Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8310347Z         x0 = x[:, :D]
2025-05-07T20:33:11.8310479Z         x1 = x[:, D:]
2025-05-07T20:33:11.8310558Z     
2025-05-07T20:33:11.8310648Z         if contiguous:
2025-05-07T20:33:11.8310753Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8310849Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8310928Z     
2025-05-07T20:33:11.8311030Z         if scale_ub is not None:
2025-05-07T20:33:11.8311142Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8311289Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8311372Z             )
2025-05-07T20:33:11.8311456Z         else:
2025-05-07T20:33:11.8311564Z             scale_ub_tensor = None
2025-05-07T20:33:11.8311644Z     
2025-05-07T20:33:11.8311781Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8311883Z             op = silu_mul_quant
2025-05-07T20:33:11.8311971Z             if compiled:
2025-05-07T20:33:11.8312071Z                 op = torch.compile(op)
2025-05-07T20:33:11.8312184Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8312261Z     
2025-05-07T20:33:11.8312354Z         y_fp8, y_scale = fn()
2025-05-07T20:33:11.8312486Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:11.8312560Z     
2025-05-07T20:33:11.8312704Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8312809Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:11.8312917Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:11.8313045Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:11.8313186Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8313261Z     
2025-05-07T20:33:11.8313374Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:11.8313378Z 
2025-05-07T20:33:11.8313478Z moe/activation_test.py:126: 
2025-05-07T20:33:11.8313612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8313726Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:11.8313866Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8314464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:11.8314566Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:11.8314946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8315186Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8315573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:11.8315851Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:11.8316246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:11.8316417Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:11.8316782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:11.8316861Z     fn()
2025-05-07T20:33:11.8317284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:11.8317422Z     self.fn.run(
2025-05-07T20:33:11.8317778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8317879Z     kernel = self.compile(
2025-05-07T20:33:11.8318378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8318559Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8318702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8318744Z 
2025-05-07T20:33:11.8318956Z self = <triton.compiler.compiler.ASTSource object at 0x7f99482d1c40>
2025-05-07T20:33:11.8319772Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8320285Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f99484c7d80>}
2025-05-07T20:33:11.8321085Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8321284Z context = <triton._C.libtriton.ir.context object at 0x7f9942815e70>
2025-05-07T20:33:11.8321288Z 
2025-05-07T20:33:11.8321458Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8321737Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8321846Z                            module_map=module_map)
2025-05-07T20:33:11.8322009Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8322119Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:11.8322199Z E       ^
2025-05-07T20:33:11.8322574Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8322585Z 
2025-05-07T20:33:11.8323021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8323030Z 
2025-05-07T20:33:11.8323140Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8323374Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8323453Z     T=2048,
2025-05-07T20:33:11.8323535Z     D=5120,
2025-05-07T20:33:11.8323627Z     scale_ub=None,
2025-05-07T20:33:11.8323716Z     contiguous=True,
2025-05-07T20:33:11.8323801Z     compiled=True,
2025-05-07T20:33:11.8323886Z )
2025-05-07T20:33:11.8324112Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8324294Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:11.8324301Z 
2025-05-07T20:33:11.8324381Z     @given(
2025-05-07T20:33:11.8324506Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8324613Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8324730Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8324853Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8324976Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8325052Z     )
2025-05-07T20:33:11.8325311Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8325631Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8325742Z         self,
2025-05-07T20:33:11.8325856Z         T: int,
2025-05-07T20:33:11.8325962Z         D: int,
2025-05-07T20:33:11.8326084Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8326182Z         contiguous: bool,
2025-05-07T20:33:11.8326267Z         compiled: bool,
2025-05-07T20:33:11.8326345Z     ) -> None:
2025-05-07T20:33:11.8326557Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8326629Z     
2025-05-07T20:33:11.8326801Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8326883Z     
2025-05-07T20:33:11.8326975Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8327220Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8327328Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8327411Z         x0 = x[:, :D]
2025-05-07T20:33:11.8327501Z         x1 = x[:, D:]
2025-05-07T20:33:11.8327576Z     
2025-05-07T20:33:11.8327662Z         if contiguous:
2025-05-07T20:33:11.8327825Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8327920Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8327996Z     
2025-05-07T20:33:11.8328098Z         if scale_ub is not None:
2025-05-07T20:33:11.8328207Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8328346Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8328429Z             )
2025-05-07T20:33:11.8328512Z         else:
2025-05-07T20:33:11.8328611Z             scale_ub_tensor = None
2025-05-07T20:33:11.8328688Z     
2025-05-07T20:33:11.8328817Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8328914Z             op = silu_mul_quant
2025-05-07T20:33:11.8328998Z             if compiled:
2025-05-07T20:33:11.8329104Z                 op = torch.compile(op)
2025-05-07T20:33:11.8329238Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8329320Z     
2025-05-07T20:33:11.8329428Z         y_fp8, y_scale = fn()
2025-05-07T20:33:11.8329558Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:11.8329630Z     
2025-05-07T20:33:11.8329766Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8329878Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:11.8329980Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:11.8330104Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:11.8330260Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8330336Z     
2025-05-07T20:33:11.8330448Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:11.8330452Z 
2025-05-07T20:33:11.8330553Z moe/activation_test.py:126: 
2025-05-07T20:33:11.8330691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8330805Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:11.8330944Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8331532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:11.8331641Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:11.8332020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8332256Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8332645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:11.8332908Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:11.8333314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:11.8333483Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:11.8333853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:11.8333935Z     fn()
2025-05-07T20:33:11.8334430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:11.8334518Z     self.fn.run(
2025-05-07T20:33:11.8334873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8335019Z     kernel = self.compile(
2025-05-07T20:33:11.8335427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8335603Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8335825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8335830Z 
2025-05-07T20:33:11.8336042Z self = <triton.compiler.compiler.ASTSource object at 0x7f9943d07140>
2025-05-07T20:33:11.8336853Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8337416Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f994384f060>}
2025-05-07T20:33:11.8338207Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8338411Z context = <triton._C.libtriton.ir.context object at 0x7f9942c972b0>
2025-05-07T20:33:11.8338415Z 
2025-05-07T20:33:11.8338584Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8338854Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8338970Z                            module_map=module_map)
2025-05-07T20:33:11.8339156Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8339278Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:11.8339365Z E       ^
2025-05-07T20:33:11.8339731Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8339738Z 
2025-05-07T20:33:11.8340179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8340184Z 
2025-05-07T20:33:11.8340288Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8340528Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8340607Z     T=128,
2025-05-07T20:33:11.8340685Z     D=5120,
2025-05-07T20:33:11.8340777Z     scale_ub=None,
2025-05-07T20:33:11.8340860Z     contiguous=True,
2025-05-07T20:33:11.8340944Z     compiled=True,
2025-05-07T20:33:11.8341028Z )
2025-05-07T20:33:11.8341259Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8341432Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:11.8341436Z 
2025-05-07T20:33:11.8341522Z     @given(
2025-05-07T20:33:11.8341643Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8341756Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8341876Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8341996Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8342118Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8342194Z     )
2025-05-07T20:33:11.8342455Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8342558Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8342637Z         self,
2025-05-07T20:33:11.8342713Z         T: int,
2025-05-07T20:33:11.8342802Z         D: int,
2025-05-07T20:33:11.8342904Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8342996Z         contiguous: bool,
2025-05-07T20:33:11.8343091Z         compiled: bool,
2025-05-07T20:33:11.8343168Z     ) -> None:
2025-05-07T20:33:11.8343276Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8343353Z     
2025-05-07T20:33:11.8343533Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8343664Z     
2025-05-07T20:33:11.8343760Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8343890Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8343988Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8344071Z         x0 = x[:, :D]
2025-05-07T20:33:11.8344228Z         x1 = x[:, D:]
2025-05-07T20:33:11.8344310Z     
2025-05-07T20:33:11.8344396Z         if contiguous:
2025-05-07T20:33:11.8344491Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8344597Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8344672Z     
2025-05-07T20:33:11.8344812Z         if scale_ub is not None:
2025-05-07T20:33:11.8344921Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8345058Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8345143Z             )
2025-05-07T20:33:11.8345224Z         else:
2025-05-07T20:33:11.8345320Z             scale_ub_tensor = None
2025-05-07T20:33:11.8345401Z     
2025-05-07T20:33:11.8345536Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8345632Z             op = silu_mul_quant
2025-05-07T20:33:11.8345728Z             if compiled:
2025-05-07T20:33:11.8345833Z                 op = torch.compile(op)
2025-05-07T20:33:11.8345943Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8346034Z     
2025-05-07T20:33:11.8346131Z         y_fp8, y_scale = fn()
2025-05-07T20:33:11.8346266Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:11.8346345Z     
2025-05-07T20:33:11.8346485Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8346604Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:11.8346710Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:11.8346838Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:11.8346991Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8347069Z     
2025-05-07T20:33:11.8347176Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:11.8347184Z 
2025-05-07T20:33:11.8347294Z moe/activation_test.py:126: 
2025-05-07T20:33:11.8347433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8347550Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:11.8347696Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8348290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:11.8348400Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:11.8348784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8349042Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8349462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:11.8349731Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:11.8350135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:11.8350315Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:11.8350680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:11.8350769Z     fn()
2025-05-07T20:33:11.8351195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:11.8351293Z     self.fn.run(
2025-05-07T20:33:11.8351649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8351744Z     kernel = self.compile(
2025-05-07T20:33:11.8352152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8352382Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8352517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8352522Z 
2025-05-07T20:33:11.8352834Z self = <triton.compiler.compiler.ASTSource object at 0x7f994825d2e0>
2025-05-07T20:33:11.8353653Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8354217Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9942aa9d00>}
2025-05-07T20:33:11.8355011Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8355217Z context = <triton._C.libtriton.ir.context object at 0x7f99425058f0>
2025-05-07T20:33:11.8355221Z 
2025-05-07T20:33:11.8355393Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8355672Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8355787Z                            module_map=module_map)
2025-05-07T20:33:11.8355953Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8356061Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:11.8356145Z E       ^
2025-05-07T20:33:11.8356518Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8356523Z 
2025-05-07T20:33:11.8356967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8356975Z 
2025-05-07T20:33:11.8357083Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8357314Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8357398Z     T=4096,
2025-05-07T20:33:11.8357477Z     D=5120,
2025-05-07T20:33:11.8357579Z     scale_ub=None,
2025-05-07T20:33:11.8357674Z     contiguous=True,
2025-05-07T20:33:11.8364011Z     compiled=True,
2025-05-07T20:33:11.8364112Z )
2025-05-07T20:33:11.8364363Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8364549Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:11.8364555Z 
2025-05-07T20:33:11.8364645Z     @given(
2025-05-07T20:33:11.8364772Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8364878Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8365009Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8365133Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8365251Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8365341Z     )
2025-05-07T20:33:11.8365599Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8365698Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8365797Z         self,
2025-05-07T20:33:11.8365878Z         T: int,
2025-05-07T20:33:11.8365959Z         D: int,
2025-05-07T20:33:11.8366068Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8366162Z         contiguous: bool,
2025-05-07T20:33:11.8366263Z         compiled: bool,
2025-05-07T20:33:11.8366347Z     ) -> None:
2025-05-07T20:33:11.8366445Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8366530Z     
2025-05-07T20:33:11.8366703Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8366781Z     
2025-05-07T20:33:11.8366883Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8367013Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8367186Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8367277Z         x0 = x[:, :D]
2025-05-07T20:33:11.8367362Z         x1 = x[:, D:]
2025-05-07T20:33:11.8367439Z     
2025-05-07T20:33:11.8367536Z         if contiguous:
2025-05-07T20:33:11.8367632Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8367806Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8367882Z     
2025-05-07T20:33:11.8367976Z         if scale_ub is not None:
2025-05-07T20:33:11.8368092Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8368272Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8368352Z             )
2025-05-07T20:33:11.8368439Z         else:
2025-05-07T20:33:11.8368536Z             scale_ub_tensor = None
2025-05-07T20:33:11.8368612Z     
2025-05-07T20:33:11.8368752Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8368847Z             op = silu_mul_quant
2025-05-07T20:33:11.8368938Z             if compiled:
2025-05-07T20:33:11.8369053Z                 op = torch.compile(op)
2025-05-07T20:33:11.8369162Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8369237Z     
2025-05-07T20:33:11.8369339Z         y_fp8, y_scale = fn()
2025-05-07T20:33:11.8369469Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:11.8369557Z     
2025-05-07T20:33:11.8369698Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8369803Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:11.8369916Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:11.8370044Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:11.8370188Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8370272Z     
2025-05-07T20:33:11.8370378Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:11.8370382Z 
2025-05-07T20:33:11.8370492Z moe/activation_test.py:126: 
2025-05-07T20:33:11.8370629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8370741Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:11.8370887Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8371489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:11.8371595Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:11.8371987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8372221Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8372622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:11.8372892Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:11.8373292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:11.8373475Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:11.8373842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:11.8373923Z     fn()
2025-05-07T20:33:11.8374507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:11.8374596Z     self.fn.run(
2025-05-07T20:33:11.8374968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8375068Z     kernel = self.compile(
2025-05-07T20:33:11.8375471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8375658Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8375869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8375874Z 
2025-05-07T20:33:11.8376094Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942773650>
2025-05-07T20:33:11.8376988Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8377510Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9942710ae0>}
2025-05-07T20:33:11.8378351Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8378551Z context = <triton._C.libtriton.ir.context object at 0x7f9943308cb0>
2025-05-07T20:33:11.8378559Z 
2025-05-07T20:33:11.8378745Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8379026Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8379145Z                            module_map=module_map)
2025-05-07T20:33:11.8379331Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8379443Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:11.8379531Z E       ^
2025-05-07T20:33:11.8379915Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8379924Z 
2025-05-07T20:33:11.8380363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8380368Z 
2025-05-07T20:33:11.8380485Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8380716Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8380800Z     T=16384,
2025-05-07T20:33:11.8380889Z     D=5120,
2025-05-07T20:33:11.8380977Z     scale_ub=None,
2025-05-07T20:33:11.8381076Z     contiguous=True,
2025-05-07T20:33:11.8381164Z     compiled=True,
2025-05-07T20:33:11.8381243Z )
2025-05-07T20:33:11.8381485Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8381669Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:11.8381673Z 
2025-05-07T20:33:11.8381759Z     @given(
2025-05-07T20:33:11.8381896Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8382001Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8382121Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8382254Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8382372Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8382461Z     )
2025-05-07T20:33:11.8382720Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8382817Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8382907Z         self,
2025-05-07T20:33:11.8382988Z         T: int,
2025-05-07T20:33:11.8383070Z         D: int,
2025-05-07T20:33:11.8383185Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8383279Z         contiguous: bool,
2025-05-07T20:33:11.8383368Z         compiled: bool,
2025-05-07T20:33:11.8383458Z     ) -> None:
2025-05-07T20:33:11.8383558Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8383637Z     
2025-05-07T20:33:11.8383821Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8383900Z     
2025-05-07T20:33:11.8384003Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8384133Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8384229Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8384322Z         x0 = x[:, :D]
2025-05-07T20:33:11.8384458Z         x1 = x[:, D:]
2025-05-07T20:33:11.8384540Z     
2025-05-07T20:33:11.8384639Z         if contiguous:
2025-05-07T20:33:11.8384740Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8384837Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8384925Z     
2025-05-07T20:33:11.8385096Z         if scale_ub is not None:
2025-05-07T20:33:11.8385210Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8385362Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8385445Z             )
2025-05-07T20:33:11.8385529Z         else:
2025-05-07T20:33:11.8385678Z             scale_ub_tensor = None
2025-05-07T20:33:11.8385759Z     
2025-05-07T20:33:11.8385907Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8386002Z             op = silu_mul_quant
2025-05-07T20:33:11.8386089Z             if compiled:
2025-05-07T20:33:11.8386202Z                 op = torch.compile(op)
2025-05-07T20:33:11.8386312Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8386396Z     
2025-05-07T20:33:11.8386497Z         y_fp8, y_scale = fn()
2025-05-07T20:33:11.8386622Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:11.8386696Z     
2025-05-07T20:33:11.8386847Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8386958Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:11.8387070Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:11.8387196Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:11.8387339Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8387425Z     
2025-05-07T20:33:11.8387532Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:11.8387536Z 
2025-05-07T20:33:11.8387638Z moe/activation_test.py:126: 
2025-05-07T20:33:11.8387781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8387891Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:11.8388030Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8388630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:11.8388735Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:11.8389168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8389420Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8389813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:11.8390091Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:11.8390488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:11.8390668Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:11.8391034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:11.8391114Z     fn()
2025-05-07T20:33:11.8391555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:11.8391644Z     self.fn.run(
2025-05-07T20:33:11.8392002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8392109Z     kernel = self.compile(
2025-05-07T20:33:11.8392517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8392710Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8392849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8392853Z 
2025-05-07T20:33:11.8393137Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942ffbdd0>
2025-05-07T20:33:11.8394037Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8394559Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98578380e0>}
2025-05-07T20:33:11.8395366Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8395606Z context = <triton._C.libtriton.ir.context object at 0x7f9857fffc30>
2025-05-07T20:33:11.8395611Z 
2025-05-07T20:33:11.8395796Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8396079Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8396191Z                            module_map=module_map)
2025-05-07T20:33:11.8396366Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8396481Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:11.8396563Z E       ^
2025-05-07T20:33:11.8396945Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8396950Z 
2025-05-07T20:33:11.8397393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8397397Z 
2025-05-07T20:33:11.8397515Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8397746Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8397829Z     T=1,
2025-05-07T20:33:11.8397920Z     D=5120,
2025-05-07T20:33:11.8398010Z     scale_ub=1200.0,
2025-05-07T20:33:11.8398099Z     contiguous=True,
2025-05-07T20:33:11.8398196Z     compiled=True,
2025-05-07T20:33:11.8398275Z )
2025-05-07T20:33:11.8398503Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8398689Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:11.8398694Z 
2025-05-07T20:33:11.8398779Z     @given(
2025-05-07T20:33:11.8398909Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8399014Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8399140Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8399270Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8399387Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8399464Z     )
2025-05-07T20:33:11.8399724Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8399826Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8399909Z         self,
2025-05-07T20:33:11.8399996Z         T: int,
2025-05-07T20:33:11.8400076Z         D: int,
2025-05-07T20:33:11.8400190Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8400287Z         contiguous: bool,
2025-05-07T20:33:11.8400378Z         compiled: bool,
2025-05-07T20:33:11.8400471Z     ) -> None:
2025-05-07T20:33:11.8400572Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8400648Z     
2025-05-07T20:33:11.8400830Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8400909Z     
2025-05-07T20:33:11.8401009Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8401145Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8401238Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8401321Z         x0 = x[:, :D]
2025-05-07T20:33:11.8401414Z         x1 = x[:, D:]
2025-05-07T20:33:11.8401490Z     
2025-05-07T20:33:11.8401578Z         if contiguous:
2025-05-07T20:33:11.8401682Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8401824Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8401906Z     
2025-05-07T20:33:11.8402003Z         if scale_ub is not None:
2025-05-07T20:33:11.8402113Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8402328Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8402408Z             )
2025-05-07T20:33:11.8402489Z         else:
2025-05-07T20:33:11.8402595Z             scale_ub_tensor = None
2025-05-07T20:33:11.8402672Z     
2025-05-07T20:33:11.8402805Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8402950Z             op = silu_mul_quant
2025-05-07T20:33:11.8403041Z             if compiled:
2025-05-07T20:33:11.8403149Z                 op = torch.compile(op)
2025-05-07T20:33:11.8403269Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8403347Z     
2025-05-07T20:33:11.8403450Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8403455Z 
2025-05-07T20:33:11.8403565Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8403702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8403816Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8403924Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8404322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8404426Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8404949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8405062Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8405443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8405676Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8406042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8406145Z     kernel = self.compile(
2025-05-07T20:33:11.8406551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8406743Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8406877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8406882Z 
2025-05-07T20:33:11.8407102Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942ffbbc0>
2025-05-07T20:33:11.8407919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8408445Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9942710180>}
2025-05-07T20:33:11.8409239Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8409444Z context = <triton._C.libtriton.ir.context object at 0x7f99432f07f0>
2025-05-07T20:33:11.8409449Z 
2025-05-07T20:33:11.8409628Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8409904Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8410024Z                            module_map=module_map)
2025-05-07T20:33:11.8410197Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8410303Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8410398Z E       ^
2025-05-07T20:33:11.8410774Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8410824Z 
2025-05-07T20:33:11.8411265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8411278Z 
2025-05-07T20:33:11.8411386Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8411691Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8411786Z     T=1,
2025-05-07T20:33:11.8411873Z     D=5120,
2025-05-07T20:33:11.8411965Z     scale_ub=None,
2025-05-07T20:33:11.8412116Z     contiguous=False,
2025-05-07T20:33:11.8412209Z     compiled=True,
2025-05-07T20:33:11.8412290Z )
2025-05-07T20:33:11.8412527Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8412703Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:11.8412707Z 
2025-05-07T20:33:11.8412792Z     @given(
2025-05-07T20:33:11.8412925Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8413037Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8413168Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8413293Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8413419Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8413510Z     )
2025-05-07T20:33:11.8413767Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8413870Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8413966Z         self,
2025-05-07T20:33:11.8414054Z         T: int,
2025-05-07T20:33:11.8414139Z         D: int,
2025-05-07T20:33:11.8414250Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8414457Z         contiguous: bool,
2025-05-07T20:33:11.8414553Z         compiled: bool,
2025-05-07T20:33:11.8414634Z     ) -> None:
2025-05-07T20:33:11.8414732Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8414818Z     
2025-05-07T20:33:11.8414996Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8415074Z     
2025-05-07T20:33:11.8415177Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8415308Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8415401Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8415499Z         x0 = x[:, :D]
2025-05-07T20:33:11.8415580Z         x1 = x[:, D:]
2025-05-07T20:33:11.8415655Z     
2025-05-07T20:33:11.8415747Z         if contiguous:
2025-05-07T20:33:11.8415841Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8415935Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8416021Z     
2025-05-07T20:33:11.8416113Z         if scale_ub is not None:
2025-05-07T20:33:11.8416229Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8416367Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8416448Z             )
2025-05-07T20:33:11.8416537Z         else:
2025-05-07T20:33:11.8416634Z             scale_ub_tensor = None
2025-05-07T20:33:11.8416715Z     
2025-05-07T20:33:11.8416857Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8416955Z             op = silu_mul_quant
2025-05-07T20:33:11.8417047Z             if compiled:
2025-05-07T20:33:11.8417161Z                 op = torch.compile(op)
2025-05-07T20:33:11.8417278Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8417358Z     
2025-05-07T20:33:11.8417461Z         y_fp8, y_scale = fn()
2025-05-07T20:33:11.8417590Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:11.8417675Z     
2025-05-07T20:33:11.8417816Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8417921Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:11.8418034Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:11.8418159Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:11.8418303Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8418436Z     
2025-05-07T20:33:11.8418541Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:11.8418546Z 
2025-05-07T20:33:11.8418649Z moe/activation_test.py:126: 
2025-05-07T20:33:11.8418790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8418970Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:11.8419143Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8419761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:11.8419928Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:11.8420317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8420549Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8420945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:11.8421217Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:11.8421614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:11.8421801Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:11.8422164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:11.8422244Z     fn()
2025-05-07T20:33:11.8422677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:11.8422766Z     self.fn.run(
2025-05-07T20:33:11.8423134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8423233Z     kernel = self.compile(
2025-05-07T20:33:11.8423642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8423831Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8423967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8423976Z 
2025-05-07T20:33:11.8424193Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942436510>
2025-05-07T20:33:11.8425010Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8425882Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f994243b060>}
2025-05-07T20:33:11.8426737Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8426941Z context = <triton._C.libtriton.ir.context object at 0x7f99432956b0>
2025-05-07T20:33:11.8426946Z 
2025-05-07T20:33:11.8427136Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8427412Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8427523Z                            module_map=module_map)
2025-05-07T20:33:11.8427696Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8427813Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:11.8427892Z E       ^
2025-05-07T20:33:11.8428268Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8428273Z 
2025-05-07T20:33:11.8428708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8428855Z 
2025-05-07T20:33:11.8428964Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8429197Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8429293Z     T=1,
2025-05-07T20:33:11.8429501Z     D=5120,
2025-05-07T20:33:11.8429594Z     scale_ub=None,
2025-05-07T20:33:11.8429694Z     contiguous=True,
2025-05-07T20:33:11.8429786Z     compiled=False,
2025-05-07T20:33:11.8429867Z )
2025-05-07T20:33:11.8430105Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8430341Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:11.8430345Z 
2025-05-07T20:33:11.8430437Z     @given(
2025-05-07T20:33:11.8430563Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8430669Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8430797Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8430924Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8431045Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8431132Z     )
2025-05-07T20:33:11.8431389Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8431495Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8431589Z         self,
2025-05-07T20:33:11.8431673Z         T: int,
2025-05-07T20:33:11.8431755Z         D: int,
2025-05-07T20:33:11.8431869Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8431966Z         contiguous: bool,
2025-05-07T20:33:11.8432065Z         compiled: bool,
2025-05-07T20:33:11.8432147Z     ) -> None:
2025-05-07T20:33:11.8432244Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8432328Z     
2025-05-07T20:33:11.8432504Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8432587Z     
2025-05-07T20:33:11.8432691Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8432824Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8432918Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8433011Z         x0 = x[:, :D]
2025-05-07T20:33:11.8433096Z         x1 = x[:, D:]
2025-05-07T20:33:11.8433174Z     
2025-05-07T20:33:11.8433269Z         if contiguous:
2025-05-07T20:33:11.8433371Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8433463Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8433545Z     
2025-05-07T20:33:11.8433641Z         if scale_ub is not None:
2025-05-07T20:33:11.8433756Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8433897Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8433976Z             )
2025-05-07T20:33:11.8434062Z         else:
2025-05-07T20:33:11.8434161Z             scale_ub_tensor = None
2025-05-07T20:33:11.8434236Z     
2025-05-07T20:33:11.8434376Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8434476Z             op = silu_mul_quant
2025-05-07T20:33:11.8434568Z             if compiled:
2025-05-07T20:33:11.8434681Z                 op = torch.compile(op)
2025-05-07T20:33:11.8434792Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8434872Z     
2025-05-07T20:33:11.8434976Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8434985Z 
2025-05-07T20:33:11.8435087Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8435233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8435341Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8435449Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8435985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8436086Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8436465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8436754Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8437115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8437222Z     kernel = self.compile(
2025-05-07T20:33:11.8437707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8437889Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8438028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8438070Z 
2025-05-07T20:33:11.8438280Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857791dc0>
2025-05-07T20:33:11.8439101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8439619Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f994243ba60>}
2025-05-07T20:33:11.8440417Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8440621Z context = <triton._C.libtriton.ir.context object at 0x7f98572ce630>
2025-05-07T20:33:11.8440628Z 
2025-05-07T20:33:11.8440799Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8441077Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8441187Z                            module_map=module_map)
2025-05-07T20:33:11.8441355Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8441465Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8441544Z E       ^
2025-05-07T20:33:11.8441919Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8441923Z 
2025-05-07T20:33:11.8442368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8442373Z 
2025-05-07T20:33:11.8442482Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8442721Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8442806Z     T=128,
2025-05-07T20:33:11.8442887Z     D=5120,
2025-05-07T20:33:11.8442982Z     scale_ub=None,
2025-05-07T20:33:11.8443074Z     contiguous=False,
2025-05-07T20:33:11.8443167Z     compiled=True,
2025-05-07T20:33:11.8443243Z )
2025-05-07T20:33:11.8443471Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8443655Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:11.8443662Z 
2025-05-07T20:33:11.8443742Z     @given(
2025-05-07T20:33:11.8443863Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8443975Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8444097Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8444216Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8444341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8444418Z     )
2025-05-07T20:33:11.8444683Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8444786Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8444867Z         self,
2025-05-07T20:33:11.8444953Z         T: int,
2025-05-07T20:33:11.8445035Z         D: int,
2025-05-07T20:33:11.8445138Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8445238Z         contiguous: bool,
2025-05-07T20:33:11.8445331Z         compiled: bool,
2025-05-07T20:33:11.8445463Z     ) -> None:
2025-05-07T20:33:11.8445571Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8445648Z     
2025-05-07T20:33:11.8445821Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8445908Z     
2025-05-07T20:33:11.8446078Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8446218Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8446310Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8446397Z         x0 = x[:, :D]
2025-05-07T20:33:11.8446491Z         x1 = x[:, D:]
2025-05-07T20:33:11.8446604Z     
2025-05-07T20:33:11.8446691Z         if contiguous:
2025-05-07T20:33:11.8446793Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8446886Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8446961Z     
2025-05-07T20:33:11.8447064Z         if scale_ub is not None:
2025-05-07T20:33:11.8447171Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8447308Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8447396Z             )
2025-05-07T20:33:11.8447475Z         else:
2025-05-07T20:33:11.8447580Z             scale_ub_tensor = None
2025-05-07T20:33:11.8447657Z     
2025-05-07T20:33:11.8447790Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8447902Z             op = silu_mul_quant
2025-05-07T20:33:11.8447996Z             if compiled:
2025-05-07T20:33:11.8448105Z                 op = torch.compile(op)
2025-05-07T20:33:11.8448225Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8448307Z     
2025-05-07T20:33:11.8448405Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8448409Z 
2025-05-07T20:33:11.8448519Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8448654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8448761Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8448871Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8449312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8449414Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8449935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8450039Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8450424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8450653Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8451023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8451119Z     kernel = self.compile(
2025-05-07T20:33:11.8451522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8451709Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8451839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8451844Z 
2025-05-07T20:33:11.8452055Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942435880>
2025-05-07T20:33:11.8452881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8453404Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857f2d1c0>}
2025-05-07T20:33:11.8454203Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8454563Z context = <triton._C.libtriton.ir.context object at 0x7f985721c430>
2025-05-07T20:33:11.8454567Z 
2025-05-07T20:33:11.8454744Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8455116Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8455227Z                            module_map=module_map)
2025-05-07T20:33:11.8455400Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8455505Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8455685Z E       ^
2025-05-07T20:33:11.8456064Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8456069Z 
2025-05-07T20:33:11.8456509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8456513Z 
2025-05-07T20:33:11.8456630Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8456861Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8456941Z     T=128,
2025-05-07T20:33:11.8457028Z     D=7168,
2025-05-07T20:33:11.8457116Z     scale_ub=1200.0,
2025-05-07T20:33:11.8457204Z     contiguous=False,
2025-05-07T20:33:11.8457304Z     compiled=False,
2025-05-07T20:33:11.8457384Z )
2025-05-07T20:33:11.8457616Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8457796Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:11.8457804Z 
2025-05-07T20:33:11.8457884Z     @given(
2025-05-07T20:33:11.8458011Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8458113Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8458231Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8458358Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8458479Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8458556Z     )
2025-05-07T20:33:11.8458820Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8458917Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8459005Z         self,
2025-05-07T20:33:11.8459088Z         T: int,
2025-05-07T20:33:11.8459165Z         D: int,
2025-05-07T20:33:11.8459271Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8459365Z         contiguous: bool,
2025-05-07T20:33:11.8459457Z         compiled: bool,
2025-05-07T20:33:11.8459542Z     ) -> None:
2025-05-07T20:33:11.8459643Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8459719Z     
2025-05-07T20:33:11.8459897Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8459975Z     
2025-05-07T20:33:11.8460069Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8460206Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8460298Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8460388Z         x0 = x[:, :D]
2025-05-07T20:33:11.8460471Z         x1 = x[:, D:]
2025-05-07T20:33:11.8460546Z     
2025-05-07T20:33:11.8460638Z         if contiguous:
2025-05-07T20:33:11.8460731Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8460822Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8460913Z     
2025-05-07T20:33:11.8461008Z         if scale_ub is not None:
2025-05-07T20:33:11.8461115Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8461259Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8461341Z             )
2025-05-07T20:33:11.8461421Z         else:
2025-05-07T20:33:11.8461524Z             scale_ub_tensor = None
2025-05-07T20:33:11.8461599Z     
2025-05-07T20:33:11.8461739Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8461836Z             op = silu_mul_quant
2025-05-07T20:33:11.8461925Z             if compiled:
2025-05-07T20:33:11.8462036Z                 op = torch.compile(op)
2025-05-07T20:33:11.8462199Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8462279Z     
2025-05-07T20:33:11.8462383Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8462388Z 
2025-05-07T20:33:11.8462491Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8462699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8462815Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8462924Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8463457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8463597Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8463974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8464210Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8464571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8464668Z     kernel = self.compile(
2025-05-07T20:33:11.8465079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8465263Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8465400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8465404Z 
2025-05-07T20:33:11.8465613Z self = <triton.compiler.compiler.ASTSource object at 0x7f98577c2cc0>
2025-05-07T20:33:11.8466431Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8466955Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857f2cd60>}
2025-05-07T20:33:11.8467752Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8467954Z context = <triton._C.libtriton.ir.context object at 0x7f9857307cf0>
2025-05-07T20:33:11.8467959Z 
2025-05-07T20:33:11.8468127Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8468405Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8468513Z                            module_map=module_map)
2025-05-07T20:33:11.8468677Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8468784Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8468863Z E       ^
2025-05-07T20:33:11.8469279Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8469287Z 
2025-05-07T20:33:11.8469728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8469733Z 
2025-05-07T20:33:11.8469843Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8470080Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8470163Z     T=128,
2025-05-07T20:33:11.8470242Z     D=5120,
2025-05-07T20:33:11.8470333Z     scale_ub=None,
2025-05-07T20:33:11.8470421Z     contiguous=False,
2025-05-07T20:33:11.8470506Z     compiled=False,
2025-05-07T20:33:11.8470588Z )
2025-05-07T20:33:11.8470814Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8470992Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:11.8471003Z 
2025-05-07T20:33:11.8471084Z     @given(
2025-05-07T20:33:11.8471253Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8471360Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8471477Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8471596Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8471789Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8471867Z     )
2025-05-07T20:33:11.8472120Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8472220Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8472338Z         self,
2025-05-07T20:33:11.8472417Z         T: int,
2025-05-07T20:33:11.8472501Z         D: int,
2025-05-07T20:33:11.8472601Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8472700Z         contiguous: bool,
2025-05-07T20:33:11.8472788Z         compiled: bool,
2025-05-07T20:33:11.8472867Z     ) -> None:
2025-05-07T20:33:11.8472973Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8473052Z     
2025-05-07T20:33:11.8473231Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8473316Z     
2025-05-07T20:33:11.8473413Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8473545Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8473655Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8473737Z         x0 = x[:, :D]
2025-05-07T20:33:11.8473819Z         x1 = x[:, D:]
2025-05-07T20:33:11.8473902Z     
2025-05-07T20:33:11.8473990Z         if contiguous:
2025-05-07T20:33:11.8474083Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8474184Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8474259Z     
2025-05-07T20:33:11.8474361Z         if scale_ub is not None:
2025-05-07T20:33:11.8474470Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8474606Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8474689Z             )
2025-05-07T20:33:11.8474768Z         else:
2025-05-07T20:33:11.8474868Z             scale_ub_tensor = None
2025-05-07T20:33:11.8474950Z     
2025-05-07T20:33:11.8475083Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8475180Z             op = silu_mul_quant
2025-05-07T20:33:11.8475279Z             if compiled:
2025-05-07T20:33:11.8475389Z                 op = torch.compile(op)
2025-05-07T20:33:11.8475499Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8475585Z     
2025-05-07T20:33:11.8475682Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8475687Z 
2025-05-07T20:33:11.8475798Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8475934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8476040Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8476154Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8476683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8476785Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8477168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8477399Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8477767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8477864Z     kernel = self.compile(
2025-05-07T20:33:11.8478268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8478456Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8478586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8478591Z 
2025-05-07T20:33:11.8478807Z self = <triton.compiler.compiler.ASTSource object at 0x7f98577c3e30>
2025-05-07T20:33:11.8479721Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8480310Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857f2e160>}
2025-05-07T20:33:11.8481108Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8481343Z context = <triton._C.libtriton.ir.context object at 0x7f98573c5d30>
2025-05-07T20:33:11.8481347Z 
2025-05-07T20:33:11.8481527Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8481800Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8481911Z                            module_map=module_map)
2025-05-07T20:33:11.8482085Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8482188Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8482273Z E       ^
2025-05-07T20:33:11.8482652Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8482656Z 
2025-05-07T20:33:11.8483095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8483103Z 
2025-05-07T20:33:11.8483214Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8483442Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8483524Z     T=128,
2025-05-07T20:33:11.8483612Z     D=5120,
2025-05-07T20:33:11.8483696Z     scale_ub=1200.0,
2025-05-07T20:33:11.8483787Z     contiguous=True,
2025-05-07T20:33:11.8483877Z     compiled=False,
2025-05-07T20:33:11.8483953Z )
2025-05-07T20:33:11.8484188Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8484363Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:11.8484367Z 
2025-05-07T20:33:11.8484451Z     @given(
2025-05-07T20:33:11.8484576Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8484680Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8484796Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8484925Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8485041Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8485126Z     )
2025-05-07T20:33:11.8485377Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8485474Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8485563Z         self,
2025-05-07T20:33:11.8485645Z         T: int,
2025-05-07T20:33:11.8485724Z         D: int,
2025-05-07T20:33:11.8485832Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8485925Z         contiguous: bool,
2025-05-07T20:33:11.8486016Z         compiled: bool,
2025-05-07T20:33:11.8486104Z     ) -> None:
2025-05-07T20:33:11.8486201Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8486281Z     
2025-05-07T20:33:11.8486460Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8486537Z     
2025-05-07T20:33:11.8486640Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8486762Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8486853Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8486936Z         x0 = x[:, :D]
2025-05-07T20:33:11.8487015Z         x1 = x[:, D:]
2025-05-07T20:33:11.8487086Z     
2025-05-07T20:33:11.8487173Z         if contiguous:
2025-05-07T20:33:11.8487262Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8487351Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8487480Z     
2025-05-07T20:33:11.8487570Z         if scale_ub is not None:
2025-05-07T20:33:11.8487676Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8487819Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8487894Z             )
2025-05-07T20:33:11.8488074Z         else:
2025-05-07T20:33:11.8488170Z             scale_ub_tensor = None
2025-05-07T20:33:11.8488243Z     
2025-05-07T20:33:11.8488380Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8488469Z             op = silu_mul_quant
2025-05-07T20:33:11.8488592Z             if compiled:
2025-05-07T20:33:11.8488695Z                 op = torch.compile(op)
2025-05-07T20:33:11.8488799Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8488871Z     
2025-05-07T20:33:11.8488976Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8488980Z 
2025-05-07T20:33:11.8489075Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8489206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8495452Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8495588Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8496139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8496253Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8496637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8496879Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8497249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8497351Z     kernel = self.compile(
2025-05-07T20:33:11.8497770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8497957Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8498091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8498109Z 
2025-05-07T20:33:11.8498323Z self = <triton.compiler.compiler.ASTSource object at 0x7f98573b3170>
2025-05-07T20:33:11.8499146Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8499678Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857f2f240>}
2025-05-07T20:33:11.8500477Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8500687Z context = <triton._C.libtriton.ir.context object at 0x7f9857bdb630>
2025-05-07T20:33:11.8500691Z 
2025-05-07T20:33:11.8500867Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8501150Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8501275Z                            module_map=module_map)
2025-05-07T20:33:11.8501445Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8501561Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8501648Z E       ^
2025-05-07T20:33:11.8502020Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8502025Z 
2025-05-07T20:33:11.8502470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8502474Z 
2025-05-07T20:33:11.8502661Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8502896Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8502989Z     T=1,
2025-05-07T20:33:11.8503072Z     D=7168,
2025-05-07T20:33:11.8503166Z     scale_ub=1200.0,
2025-05-07T20:33:11.8503258Z     contiguous=True,
2025-05-07T20:33:11.8503428Z     compiled=True,
2025-05-07T20:33:11.8503518Z )
2025-05-07T20:33:11.8503748Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8503918Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:11.8503968Z 
2025-05-07T20:33:11.8504064Z     @given(
2025-05-07T20:33:11.8504192Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8504300Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8504429Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8504554Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8504686Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8504767Z     )
2025-05-07T20:33:11.8505026Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8505137Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8505220Z         self,
2025-05-07T20:33:11.8505314Z         T: int,
2025-05-07T20:33:11.8505400Z         D: int,
2025-05-07T20:33:11.8505504Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8505596Z         contiguous: bool,
2025-05-07T20:33:11.8505694Z         compiled: bool,
2025-05-07T20:33:11.8505781Z     ) -> None:
2025-05-07T20:33:11.8505881Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8505964Z     
2025-05-07T20:33:11.8506138Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8506224Z     
2025-05-07T20:33:11.8506320Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8506452Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8506555Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8506638Z         x0 = x[:, :D]
2025-05-07T20:33:11.8506722Z         x1 = x[:, D:]
2025-05-07T20:33:11.8506804Z     
2025-05-07T20:33:11.8506891Z         if contiguous:
2025-05-07T20:33:11.8506987Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8507090Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8507171Z     
2025-05-07T20:33:11.8507266Z         if scale_ub is not None:
2025-05-07T20:33:11.8507382Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8507521Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8507612Z             )
2025-05-07T20:33:11.8507693Z         else:
2025-05-07T20:33:11.8507791Z             scale_ub_tensor = None
2025-05-07T20:33:11.8507877Z     
2025-05-07T20:33:11.8508011Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8508106Z             op = silu_mul_quant
2025-05-07T20:33:11.8508206Z             if compiled:
2025-05-07T20:33:11.8508311Z                 op = torch.compile(op)
2025-05-07T20:33:11.8508423Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8508506Z     
2025-05-07T20:33:11.8508600Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8508605Z 
2025-05-07T20:33:11.8508707Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8508857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8508965Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8509079Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8509469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8509571Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8510107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8510209Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8510590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8510884Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8511249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8511428Z     kernel = self.compile(
2025-05-07T20:33:11.8511838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8512019Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8512201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8512206Z 
2025-05-07T20:33:11.8512419Z self = <triton.compiler.compiler.ASTSource object at 0x7f98573b1160>
2025-05-07T20:33:11.8513244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8513766Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857368900>}
2025-05-07T20:33:11.8514569Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8514781Z context = <triton._C.libtriton.ir.context object at 0x7f98574f24b0>
2025-05-07T20:33:11.8514788Z 
2025-05-07T20:33:11.8514960Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8515248Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8515359Z                            module_map=module_map)
2025-05-07T20:33:11.8515531Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8515646Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8515725Z E       ^
2025-05-07T20:33:11.8516109Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8516118Z 
2025-05-07T20:33:11.8516560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8516565Z 
2025-05-07T20:33:11.8516673Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8516920Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8517005Z     T=1,
2025-05-07T20:33:11.8517086Z     D=7168,
2025-05-07T20:33:11.8517179Z     scale_ub=1200.0,
2025-05-07T20:33:11.8517271Z     contiguous=False,
2025-05-07T20:33:11.8517367Z     compiled=True,
2025-05-07T20:33:11.8517446Z )
2025-05-07T20:33:11.8517672Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8517855Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:11.8517860Z 
2025-05-07T20:33:11.8517942Z     @given(
2025-05-07T20:33:11.8518065Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8518182Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8518301Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8518422Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8518548Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8518627Z     )
2025-05-07T20:33:11.8518892Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8518993Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8519077Z         self,
2025-05-07T20:33:11.8519166Z         T: int,
2025-05-07T20:33:11.8519243Z         D: int,
2025-05-07T20:33:11.8519343Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8519533Z         contiguous: bool,
2025-05-07T20:33:11.8519620Z         compiled: bool,
2025-05-07T20:33:11.8519703Z     ) -> None:
2025-05-07T20:33:11.8519805Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8519878Z     
2025-05-07T20:33:11.8520050Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8520207Z     
2025-05-07T20:33:11.8520301Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8520436Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8520524Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8520646Z         x0 = x[:, :D]
2025-05-07T20:33:11.8520741Z         x1 = x[:, D:]
2025-05-07T20:33:11.8520818Z     
2025-05-07T20:33:11.8520912Z         if contiguous:
2025-05-07T20:33:11.8521019Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8521114Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8521192Z     
2025-05-07T20:33:11.8521297Z         if scale_ub is not None:
2025-05-07T20:33:11.8521410Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8521550Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8521638Z             )
2025-05-07T20:33:11.8521713Z         else:
2025-05-07T20:33:11.8521814Z             scale_ub_tensor = None
2025-05-07T20:33:11.8521889Z     
2025-05-07T20:33:11.8522028Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8522132Z             op = silu_mul_quant
2025-05-07T20:33:11.8522223Z             if compiled:
2025-05-07T20:33:11.8522330Z                 op = torch.compile(op)
2025-05-07T20:33:11.8522458Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8522538Z     
2025-05-07T20:33:11.8522636Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8522641Z 
2025-05-07T20:33:11.8522754Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8522891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8523000Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8523119Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8523510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8523616Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8524149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8524252Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8524639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8524874Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8525245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8525343Z     kernel = self.compile(
2025-05-07T20:33:11.8526164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8526369Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8526503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8526508Z 
2025-05-07T20:33:11.8526724Z self = <triton.compiler.compiler.ASTSource object at 0x7f98573b2c00>
2025-05-07T20:33:11.8527546Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8528065Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857369f80>}
2025-05-07T20:33:11.8528866Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8529247Z context = <triton._C.libtriton.ir.context object at 0x7f9857ba1e30>
2025-05-07T20:33:11.8529252Z 
2025-05-07T20:33:11.8529433Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8529838Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8529952Z                            module_map=module_map)
2025-05-07T20:33:11.8530130Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8530302Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8530381Z E       ^
2025-05-07T20:33:11.8530762Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8530766Z 
2025-05-07T20:33:11.8531203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8531210Z 
2025-05-07T20:33:11.8531323Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8531553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8531634Z     T=1,
2025-05-07T20:33:11.8531718Z     D=7168,
2025-05-07T20:33:11.8531808Z     scale_ub=None,
2025-05-07T20:33:11.8531898Z     contiguous=False,
2025-05-07T20:33:11.8531992Z     compiled=True,
2025-05-07T20:33:11.8532071Z )
2025-05-07T20:33:11.8532304Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8532477Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:11.8532482Z 
2025-05-07T20:33:11.8532562Z     @given(
2025-05-07T20:33:11.8532690Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8532793Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8532909Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8533036Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8533153Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8533225Z     )
2025-05-07T20:33:11.8533490Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8533585Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8533676Z         self,
2025-05-07T20:33:11.8533758Z         T: int,
2025-05-07T20:33:11.8533835Z         D: int,
2025-05-07T20:33:11.8533946Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8534034Z         contiguous: bool,
2025-05-07T20:33:11.8534120Z         compiled: bool,
2025-05-07T20:33:11.8534206Z     ) -> None:
2025-05-07T20:33:11.8534301Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8534513Z     
2025-05-07T20:33:11.8534694Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8534768Z     
2025-05-07T20:33:11.8534860Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8534995Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8535088Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8535177Z         x0 = x[:, :D]
2025-05-07T20:33:11.8535256Z         x1 = x[:, D:]
2025-05-07T20:33:11.8535333Z     
2025-05-07T20:33:11.8535428Z         if contiguous:
2025-05-07T20:33:11.8535521Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8535617Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8535697Z     
2025-05-07T20:33:11.8535788Z         if scale_ub is not None:
2025-05-07T20:33:11.8535893Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8536037Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8536113Z             )
2025-05-07T20:33:11.8536191Z         else:
2025-05-07T20:33:11.8536296Z             scale_ub_tensor = None
2025-05-07T20:33:11.8536368Z     
2025-05-07T20:33:11.8536497Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8536596Z             op = silu_mul_quant
2025-05-07T20:33:11.8536679Z             if compiled:
2025-05-07T20:33:11.8536841Z                 op = torch.compile(op)
2025-05-07T20:33:11.8536948Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8537018Z     
2025-05-07T20:33:11.8537117Z         y_fp8, y_scale = fn()
2025-05-07T20:33:11.8537239Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:11.8537391Z     
2025-05-07T20:33:11.8537539Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8537644Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:11.8537743Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:11.8537910Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:11.8538052Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8538135Z     
2025-05-07T20:33:11.8538235Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:11.8538240Z 
2025-05-07T20:33:11.8538339Z moe/activation_test.py:126: 
2025-05-07T20:33:11.8538479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8538591Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:11.8538727Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.8539394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:11.8539499Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:11.8539886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8540122Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8540512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:11.8540789Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:11.8541186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:11.8541371Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:11.8541736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:11.8541821Z     fn()
2025-05-07T20:33:11.8542254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:11.8542340Z     self.fn.run(
2025-05-07T20:33:11.8542700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8542804Z     kernel = self.compile(
2025-05-07T20:33:11.8543207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8543399Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8543534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8543539Z 
2025-05-07T20:33:11.8543748Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857434380>
2025-05-07T20:33:11.8544574Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8545092Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f985736aca0>}
2025-05-07T20:33:11.8545898Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8546100Z context = <triton._C.libtriton.ir.context object at 0x7f9857b32430>
2025-05-07T20:33:11.8546150Z 
2025-05-07T20:33:11.8546321Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8546609Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8546791Z                            module_map=module_map)
2025-05-07T20:33:11.8546966Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8547072Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:11.8547155Z E       ^
2025-05-07T20:33:11.8547535Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8547579Z 
2025-05-07T20:33:11.8548016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8548021Z 
2025-05-07T20:33:11.8548136Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8548368Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8548447Z     T=1,
2025-05-07T20:33:11.8548539Z     D=5120,
2025-05-07T20:33:11.8548623Z     scale_ub=1200.0,
2025-05-07T20:33:11.8548713Z     contiguous=False,
2025-05-07T20:33:11.8548807Z     compiled=True,
2025-05-07T20:33:11.8548888Z )
2025-05-07T20:33:11.8549122Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8549298Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:11.8549303Z 
2025-05-07T20:33:11.8549382Z     @given(
2025-05-07T20:33:11.8549519Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8549621Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8549736Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8549860Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8549973Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8550049Z     )
2025-05-07T20:33:11.8550315Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8550409Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8550484Z         self,
2025-05-07T20:33:11.8550568Z         T: int,
2025-05-07T20:33:11.8550646Z         D: int,
2025-05-07T20:33:11.8550750Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8550850Z         contiguous: bool,
2025-05-07T20:33:11.8550939Z         compiled: bool,
2025-05-07T20:33:11.8551026Z     ) -> None:
2025-05-07T20:33:11.8551120Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8551201Z     
2025-05-07T20:33:11.8551380Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8551458Z     
2025-05-07T20:33:11.8551551Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8551686Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8551778Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8551865Z         x0 = x[:, :D]
2025-05-07T20:33:11.8551958Z         x1 = x[:, D:]
2025-05-07T20:33:11.8552038Z     
2025-05-07T20:33:11.8552128Z         if contiguous:
2025-05-07T20:33:11.8552228Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8552322Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8552398Z     
2025-05-07T20:33:11.8552499Z         if scale_ub is not None:
2025-05-07T20:33:11.8552617Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8552763Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8552846Z             )
2025-05-07T20:33:11.8552931Z         else:
2025-05-07T20:33:11.8553038Z             scale_ub_tensor = None
2025-05-07T20:33:11.8553116Z     
2025-05-07T20:33:11.8553246Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8553344Z             op = silu_mul_quant
2025-05-07T20:33:11.8553429Z             if compiled:
2025-05-07T20:33:11.8553528Z                 op = torch.compile(op)
2025-05-07T20:33:11.8553641Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8553766Z     
2025-05-07T20:33:11.8553856Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8553868Z 
2025-05-07T20:33:11.8553965Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8554095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8554289Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8554397Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8554783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8554881Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8555470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8555585Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8555960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8556196Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8556560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8556654Z     kernel = self.compile(
2025-05-07T20:33:11.8557063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8557249Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8557378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8557386Z 
2025-05-07T20:33:11.8557599Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857435040>
2025-05-07T20:33:11.8558411Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8558930Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857b74400>}
2025-05-07T20:33:11.8559730Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8559925Z context = <triton._C.libtriton.ir.context object at 0x7f9857b6d0b0>
2025-05-07T20:33:11.8559929Z 
2025-05-07T20:33:11.8560117Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8560391Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8560504Z                            module_map=module_map)
2025-05-07T20:33:11.8560683Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8560783Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8560861Z E       ^
2025-05-07T20:33:11.8561237Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8561241Z 
2025-05-07T20:33:11.8561681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8561686Z 
2025-05-07T20:33:11.8561795Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8562024Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8562121Z     T=1,
2025-05-07T20:33:11.8562196Z     D=5120,
2025-05-07T20:33:11.8562279Z     scale_ub=1200.0,
2025-05-07T20:33:11.8562376Z     contiguous=False,
2025-05-07T20:33:11.8562464Z     compiled=False,
2025-05-07T20:33:11.8562539Z )
2025-05-07T20:33:11.8562773Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8562944Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:11.8562996Z 
2025-05-07T20:33:11.8563088Z     @given(
2025-05-07T20:33:11.8563212Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8563315Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8563516Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8563640Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8563757Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8563845Z     )
2025-05-07T20:33:11.8564099Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8564240Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8564324Z         self,
2025-05-07T20:33:11.8564404Z         T: int,
2025-05-07T20:33:11.8564490Z         D: int,
2025-05-07T20:33:11.8564595Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8564688Z         contiguous: bool,
2025-05-07T20:33:11.8564783Z         compiled: bool,
2025-05-07T20:33:11.8564865Z     ) -> None:
2025-05-07T20:33:11.8564962Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8565042Z     
2025-05-07T20:33:11.8565211Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8565289Z     
2025-05-07T20:33:11.8565388Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8565522Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8565617Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8565706Z         x0 = x[:, :D]
2025-05-07T20:33:11.8565789Z         x1 = x[:, D:]
2025-05-07T20:33:11.8565876Z     
2025-05-07T20:33:11.8565963Z         if contiguous:
2025-05-07T20:33:11.8566062Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8566159Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8566236Z     
2025-05-07T20:33:11.8566328Z         if scale_ub is not None:
2025-05-07T20:33:11.8566443Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8566580Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8566661Z             )
2025-05-07T20:33:11.8566746Z         else:
2025-05-07T20:33:11.8566846Z             scale_ub_tensor = None
2025-05-07T20:33:11.8566923Z     
2025-05-07T20:33:11.8567060Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8567159Z             op = silu_mul_quant
2025-05-07T20:33:11.8567253Z             if compiled:
2025-05-07T20:33:11.8567356Z                 op = torch.compile(op)
2025-05-07T20:33:11.8567464Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8567545Z     
2025-05-07T20:33:11.8567641Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8567645Z 
2025-05-07T20:33:11.8567744Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8567886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8567990Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8568092Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8568623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8568724Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8569132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8569390Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8569746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8569848Z     kernel = self.compile(
2025-05-07T20:33:11.8570248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8570425Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8570566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8570571Z 
2025-05-07T20:33:11.8570835Z self = <triton.compiler.compiler.ASTSource object at 0x7f985774bd40>
2025-05-07T20:33:11.8571731Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8572247Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f994223c2c0>}
2025-05-07T20:33:11.8573047Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8573278Z context = <triton._C.libtriton.ir.context object at 0x7f9942b0fb70>
2025-05-07T20:33:11.8573282Z 
2025-05-07T20:33:11.8573452Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8573731Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8573838Z                            module_map=module_map)
2025-05-07T20:33:11.8574006Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8574110Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8574189Z E       ^
2025-05-07T20:33:11.8574704Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8574709Z 
2025-05-07T20:33:11.8575151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8575156Z 
2025-05-07T20:33:11.8575273Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8575502Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8575580Z     T=16384,
2025-05-07T20:33:11.8575665Z     D=5120,
2025-05-07T20:33:11.8575749Z     scale_ub=1200.0,
2025-05-07T20:33:11.8575837Z     contiguous=False,
2025-05-07T20:33:11.8575928Z     compiled=True,
2025-05-07T20:33:11.8576007Z )
2025-05-07T20:33:11.8576233Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8576430Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:11.8576434Z 
2025-05-07T20:33:11.8576509Z     @given(
2025-05-07T20:33:11.8576628Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8576738Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8576859Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8576985Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8577102Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8577179Z     )
2025-05-07T20:33:11.8577440Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8577538Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8577621Z         self,
2025-05-07T20:33:11.8577704Z         T: int,
2025-05-07T20:33:11.8577783Z         D: int,
2025-05-07T20:33:11.8577882Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8577979Z         contiguous: bool,
2025-05-07T20:33:11.8578075Z         compiled: bool,
2025-05-07T20:33:11.8578159Z     ) -> None:
2025-05-07T20:33:11.8578253Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8578327Z     
2025-05-07T20:33:11.8578504Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8578581Z     
2025-05-07T20:33:11.8578676Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8578810Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8578902Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8578986Z         x0 = x[:, :D]
2025-05-07T20:33:11.8579077Z         x1 = x[:, D:]
2025-05-07T20:33:11.8579159Z     
2025-05-07T20:33:11.8579247Z         if contiguous:
2025-05-07T20:33:11.8579406Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8579499Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8579577Z     
2025-05-07T20:33:11.8579680Z         if scale_ub is not None:
2025-05-07T20:33:11.8579792Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8580655Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8580742Z             )
2025-05-07T20:33:11.8580825Z         else:
2025-05-07T20:33:11.8580933Z             scale_ub_tensor = None
2025-05-07T20:33:11.8581006Z     
2025-05-07T20:33:11.8581135Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8581275Z             op = silu_mul_quant
2025-05-07T20:33:11.8581359Z             if compiled:
2025-05-07T20:33:11.8581457Z                 op = torch.compile(op)
2025-05-07T20:33:11.8581571Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8581646Z     
2025-05-07T20:33:11.8581736Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8581748Z 
2025-05-07T20:33:11.8581850Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8581981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8582089Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8582188Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8582577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8582674Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8583194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8583295Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8583676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8583903Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8584263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8584359Z     kernel = self.compile(
2025-05-07T20:33:11.8584763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8584951Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8585082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8585086Z 
2025-05-07T20:33:11.8585302Z self = <triton.compiler.compiler.ASTSource object at 0x7f985774b650>
2025-05-07T20:33:11.8586114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8586626Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9942295da0>}
2025-05-07T20:33:11.8587423Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8587618Z context = <triton._C.libtriton.ir.context object at 0x7f9857618130>
2025-05-07T20:33:11.8587623Z 
2025-05-07T20:33:11.8587799Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8588071Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8588179Z                            module_map=module_map)
2025-05-07T20:33:11.8588350Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8588448Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8588533Z E       ^
2025-05-07T20:33:11.8588904Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8588957Z 
2025-05-07T20:33:11.8589394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8589399Z 
2025-05-07T20:33:11.8589610Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8589841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8589928Z     T=2048,
2025-05-07T20:33:11.8590007Z     D=7168,
2025-05-07T20:33:11.8590088Z     scale_ub=1200.0,
2025-05-07T20:33:11.8590223Z     contiguous=False,
2025-05-07T20:33:11.8590310Z     compiled=True,
2025-05-07T20:33:11.8590389Z )
2025-05-07T20:33:11.8590621Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8590802Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:11.8590807Z 
2025-05-07T20:33:11.8590885Z     @given(
2025-05-07T20:33:11.8591016Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8591119Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8591241Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8591363Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8591484Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8591569Z     )
2025-05-07T20:33:11.8591822Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8591918Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8592011Z         self,
2025-05-07T20:33:11.8592093Z         T: int,
2025-05-07T20:33:11.8592172Z         D: int,
2025-05-07T20:33:11.8592283Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8592375Z         contiguous: bool,
2025-05-07T20:33:11.8592462Z         compiled: bool,
2025-05-07T20:33:11.8592547Z     ) -> None:
2025-05-07T20:33:11.8592641Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8592720Z     
2025-05-07T20:33:11.8592897Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8592974Z     
2025-05-07T20:33:11.8593074Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8593197Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8593294Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8593386Z         x0 = x[:, :D]
2025-05-07T20:33:11.8593467Z         x1 = x[:, D:]
2025-05-07T20:33:11.8593544Z     
2025-05-07T20:33:11.8593638Z         if contiguous:
2025-05-07T20:33:11.8593735Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8593829Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8593915Z     
2025-05-07T20:33:11.8594009Z         if scale_ub is not None:
2025-05-07T20:33:11.8594117Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8594257Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8594334Z             )
2025-05-07T20:33:11.8594421Z         else:
2025-05-07T20:33:11.8594518Z             scale_ub_tensor = None
2025-05-07T20:33:11.8594590Z     
2025-05-07T20:33:11.8594732Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8594826Z             op = silu_mul_quant
2025-05-07T20:33:11.8594914Z             if compiled:
2025-05-07T20:33:11.8595030Z                 op = torch.compile(op)
2025-05-07T20:33:11.8595139Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8595217Z     
2025-05-07T20:33:11.8595317Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8595322Z 
2025-05-07T20:33:11.8595423Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8595566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8595669Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8595772Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8596170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8596263Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8596835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8596941Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8597389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8597625Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8597980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8598110Z     kernel = self.compile(
2025-05-07T20:33:11.8598519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8598697Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8598824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8598838Z 
2025-05-07T20:33:11.8599069Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857c96d80>
2025-05-07T20:33:11.8599908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8600428Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f99422949a0>}
2025-05-07T20:33:11.8601222Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8601422Z context = <triton._C.libtriton.ir.context object at 0x7f98575f7c70>
2025-05-07T20:33:11.8601427Z 
2025-05-07T20:33:11.8601596Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8601865Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8601977Z                            module_map=module_map)
2025-05-07T20:33:11.8602144Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8602243Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8602329Z E       ^
2025-05-07T20:33:11.8602695Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8602702Z 
2025-05-07T20:33:11.8603148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8603152Z 
2025-05-07T20:33:11.8603258Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8603484Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8603575Z     T=1,
2025-05-07T20:33:11.8603655Z     D=5120,
2025-05-07T20:33:11.8603748Z     scale_ub=None,
2025-05-07T20:33:11.8603838Z     contiguous=False,
2025-05-07T20:33:11.8603925Z     compiled=False,
2025-05-07T20:33:11.8604007Z )
2025-05-07T20:33:11.8604238Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8604410Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:11.8604414Z 
2025-05-07T20:33:11.8604500Z     @given(
2025-05-07T20:33:11.8604625Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8604727Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8604851Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8604969Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8605090Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8605167Z     )
2025-05-07T20:33:11.8605418Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8605575Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8605653Z         self,
2025-05-07T20:33:11.8605734Z         T: int,
2025-05-07T20:33:11.8605822Z         D: int,
2025-05-07T20:33:11.8605924Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8606091Z         contiguous: bool,
2025-05-07T20:33:11.8606187Z         compiled: bool,
2025-05-07T20:33:11.8606269Z     ) -> None:
2025-05-07T20:33:11.8606367Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8606447Z     
2025-05-07T20:33:11.8606621Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8606739Z     
2025-05-07T20:33:11.8606844Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8606970Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8607072Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8607158Z         x0 = x[:, :D]
2025-05-07T20:33:11.8607243Z         x1 = x[:, D:]
2025-05-07T20:33:11.8607324Z     
2025-05-07T20:33:11.8607414Z         if contiguous:
2025-05-07T20:33:11.8607508Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8607610Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8607685Z     
2025-05-07T20:33:11.8607779Z         if scale_ub is not None:
2025-05-07T20:33:11.8607898Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8608039Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8608118Z             )
2025-05-07T20:33:11.8608205Z         else:
2025-05-07T20:33:11.8608302Z             scale_ub_tensor = None
2025-05-07T20:33:11.8608389Z     
2025-05-07T20:33:11.8608520Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8608613Z             op = silu_mul_quant
2025-05-07T20:33:11.8608708Z             if compiled:
2025-05-07T20:33:11.8608809Z                 op = torch.compile(op)
2025-05-07T20:33:11.8608917Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8608999Z     
2025-05-07T20:33:11.8609097Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8609101Z 
2025-05-07T20:33:11.8609201Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8609339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8609442Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8609554Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8610082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8610182Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8610571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8610800Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8611159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8611262Z     kernel = self.compile(
2025-05-07T20:33:11.8611667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8611855Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8611990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8611994Z 
2025-05-07T20:33:11.8612212Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857c94f20>
2025-05-07T20:33:11.8613036Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8613551Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9942713880>}
2025-05-07T20:33:11.8614475Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8614726Z context = <triton._C.libtriton.ir.context object at 0x7f9857575230>
2025-05-07T20:33:11.8614730Z 
2025-05-07T20:33:11.8614984Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8615262Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8615372Z                            module_map=module_map)
2025-05-07T20:33:11.8615584Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8615686Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8615765Z E       ^
2025-05-07T20:33:11.8616144Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8616148Z 
2025-05-07T20:33:11.8616591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8616595Z 
2025-05-07T20:33:11.8616710Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8616939Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8617026Z     T=4096,
2025-05-07T20:33:11.8617111Z     D=7168,
2025-05-07T20:33:11.8617199Z     scale_ub=1200.0,
2025-05-07T20:33:11.8617288Z     contiguous=False,
2025-05-07T20:33:11.8617381Z     compiled=False,
2025-05-07T20:33:11.8617460Z )
2025-05-07T20:33:11.8617689Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8617879Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:11.8617884Z 
2025-05-07T20:33:11.8617964Z     @given(
2025-05-07T20:33:11.8618091Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8618192Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8618316Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8618442Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8618557Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8618634Z     )
2025-05-07T20:33:11.8618902Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8618999Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8619079Z         self,
2025-05-07T20:33:11.8619165Z         T: int,
2025-05-07T20:33:11.8619242Z         D: int,
2025-05-07T20:33:11.8619352Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8619445Z         contiguous: bool,
2025-05-07T20:33:11.8619534Z         compiled: bool,
2025-05-07T20:33:11.8619621Z     ) -> None:
2025-05-07T20:33:11.8619720Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8619798Z     
2025-05-07T20:33:11.8619980Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8620058Z     
2025-05-07T20:33:11.8620156Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8620291Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8620382Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8620465Z         x0 = x[:, :D]
2025-05-07T20:33:11.8620552Z         x1 = x[:, D:]
2025-05-07T20:33:11.8620631Z     
2025-05-07T20:33:11.8620723Z         if contiguous:
2025-05-07T20:33:11.8620817Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8620910Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8620994Z     
2025-05-07T20:33:11.8621089Z         if scale_ub is not None:
2025-05-07T20:33:11.8621200Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8621344Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8621421Z             )
2025-05-07T20:33:11.8621501Z         else:
2025-05-07T20:33:11.8621605Z             scale_ub_tensor = None
2025-05-07T20:33:11.8621680Z     
2025-05-07T20:33:11.8621812Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8621967Z             op = silu_mul_quant
2025-05-07T20:33:11.8622055Z             if compiled:
2025-05-07T20:33:11.8622156Z                 op = torch.compile(op)
2025-05-07T20:33:11.8622290Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8622365Z     
2025-05-07T20:33:11.8622572Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8622577Z 
2025-05-07T20:33:11.8629533Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8629708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8630062Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8630170Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8630705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8630816Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8631199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8631434Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8631805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8631913Z     kernel = self.compile(
2025-05-07T20:33:11.8632334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8632517Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8632659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8632665Z 
2025-05-07T20:33:11.8632890Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857c957c0>
2025-05-07T20:33:11.8633706Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8634239Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9949700e00>}
2025-05-07T20:33:11.8635040Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8635238Z context = <triton._C.libtriton.ir.context object at 0x7f98575df8b0>
2025-05-07T20:33:11.8635257Z 
2025-05-07T20:33:11.8635432Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8635710Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8635839Z                            module_map=module_map)
2025-05-07T20:33:11.8636012Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8636123Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8636218Z E       ^
2025-05-07T20:33:11.8636595Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8636600Z 
2025-05-07T20:33:11.8637052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8637057Z 
2025-05-07T20:33:11.8637169Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8637404Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8637497Z     T=16384,
2025-05-07T20:33:11.8637581Z     D=7168,
2025-05-07T20:33:11.8637668Z     scale_ub=None,
2025-05-07T20:33:11.8637771Z     contiguous=True,
2025-05-07T20:33:11.8637859Z     compiled=True,
2025-05-07T20:33:11.8637941Z )
2025-05-07T20:33:11.8638177Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8638446Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:11.8638450Z 
2025-05-07T20:33:11.8638543Z     @given(
2025-05-07T20:33:11.8638667Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8638923Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8639054Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8639176Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8639293Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8639423Z     )
2025-05-07T20:33:11.8639679Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8639778Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8639865Z         self,
2025-05-07T20:33:11.8639946Z         T: int,
2025-05-07T20:33:11.8640034Z         D: int,
2025-05-07T20:33:11.8640136Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8640233Z         contiguous: bool,
2025-05-07T20:33:11.8640333Z         compiled: bool,
2025-05-07T20:33:11.8640417Z     ) -> None:
2025-05-07T20:33:11.8640518Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8640604Z     
2025-05-07T20:33:11.8640778Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8640862Z     
2025-05-07T20:33:11.8640967Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8641099Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8641194Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8641291Z         x0 = x[:, :D]
2025-05-07T20:33:11.8641383Z         x1 = x[:, D:]
2025-05-07T20:33:11.8641473Z     
2025-05-07T20:33:11.8641569Z         if contiguous:
2025-05-07T20:33:11.8641669Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8641775Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8641855Z     
2025-05-07T20:33:11.8641958Z         if scale_ub is not None:
2025-05-07T20:33:11.8642082Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8642225Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8642305Z             )
2025-05-07T20:33:11.8642395Z         else:
2025-05-07T20:33:11.8642492Z             scale_ub_tensor = None
2025-05-07T20:33:11.8642569Z     
2025-05-07T20:33:11.8642717Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8642818Z             op = silu_mul_quant
2025-05-07T20:33:11.8642910Z             if compiled:
2025-05-07T20:33:11.8643030Z                 op = torch.compile(op)
2025-05-07T20:33:11.8643142Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8643234Z     
2025-05-07T20:33:11.8643333Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8643338Z 
2025-05-07T20:33:11.8643444Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8643594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8643704Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8643809Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8644210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8644308Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8644848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8644951Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8645331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8645578Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8645939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8646037Z     kernel = self.compile(
2025-05-07T20:33:11.8646458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8646694Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8646836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8646840Z 
2025-05-07T20:33:11.8647138Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857c38890>
2025-05-07T20:33:11.8647956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8648532Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f994946ae80>}
2025-05-07T20:33:11.8649328Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8649537Z context = <triton._C.libtriton.ir.context object at 0x7f9856e4e270>
2025-05-07T20:33:11.8649542Z 
2025-05-07T20:33:11.8649715Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8650004Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8650119Z                            module_map=module_map)
2025-05-07T20:33:11.8650286Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8650405Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8650489Z E       ^
2025-05-07T20:33:11.8650861Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8650866Z 
2025-05-07T20:33:11.8651316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8651323Z 
2025-05-07T20:33:11.8651430Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8651670Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8651753Z     T=4096,
2025-05-07T20:33:11.8651834Z     D=5120,
2025-05-07T20:33:11.8651929Z     scale_ub=None,
2025-05-07T20:33:11.8652027Z     contiguous=False,
2025-05-07T20:33:11.8652112Z     compiled=True,
2025-05-07T20:33:11.8652196Z )
2025-05-07T20:33:11.8652422Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8652607Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:11.8652620Z 
2025-05-07T20:33:11.8652701Z     @given(
2025-05-07T20:33:11.8652823Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8652933Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8653051Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8653173Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8653303Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8653380Z     )
2025-05-07T20:33:11.8653634Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8653741Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8653826Z         self,
2025-05-07T20:33:11.8653907Z         T: int,
2025-05-07T20:33:11.8653993Z         D: int,
2025-05-07T20:33:11.8654094Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8654195Z         contiguous: bool,
2025-05-07T20:33:11.8654288Z         compiled: bool,
2025-05-07T20:33:11.8654554Z     ) -> None:
2025-05-07T20:33:11.8654659Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8654736Z     
2025-05-07T20:33:11.8654909Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8654996Z     
2025-05-07T20:33:11.8655091Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8655222Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8655378Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8655464Z         x0 = x[:, :D]
2025-05-07T20:33:11.8655548Z         x1 = x[:, D:]
2025-05-07T20:33:11.8655632Z     
2025-05-07T20:33:11.8655719Z         if contiguous:
2025-05-07T20:33:11.8655824Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8655999Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8656077Z     
2025-05-07T20:33:11.8656180Z         if scale_ub is not None:
2025-05-07T20:33:11.8656288Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8656426Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8656555Z             )
2025-05-07T20:33:11.8656636Z         else:
2025-05-07T20:33:11.8656737Z             scale_ub_tensor = None
2025-05-07T20:33:11.8656824Z     
2025-05-07T20:33:11.8656958Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8657053Z             op = silu_mul_quant
2025-05-07T20:33:11.8657149Z             if compiled:
2025-05-07T20:33:11.8657256Z                 op = torch.compile(op)
2025-05-07T20:33:11.8657368Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8657453Z     
2025-05-07T20:33:11.8657550Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8657554Z 
2025-05-07T20:33:11.8657662Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8657804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8657911Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8658024Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8658416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8658515Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8659083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8659200Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8659590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8659822Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8660188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8660295Z     kernel = self.compile(
2025-05-07T20:33:11.8660702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8660895Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8661031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8661035Z 
2025-05-07T20:33:11.8661247Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857c38cb0>
2025-05-07T20:33:11.8662069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8662597Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857cb7ba0>}
2025-05-07T20:33:11.8663405Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8663604Z context = <triton._C.libtriton.ir.context object at 0x7f9856e92270>
2025-05-07T20:33:11.8663608Z 
2025-05-07T20:33:11.8663781Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8664070Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8664180Z                            module_map=module_map)
2025-05-07T20:33:11.8664447Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8664554Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8664636Z E       ^
2025-05-07T20:33:11.8665097Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8665103Z 
2025-05-07T20:33:11.8665544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8665549Z 
2025-05-07T20:33:11.8665711Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8665944Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8666027Z     T=4096,
2025-05-07T20:33:11.8666119Z     D=5120,
2025-05-07T20:33:11.8666208Z     scale_ub=1200.0,
2025-05-07T20:33:11.8666302Z     contiguous=False,
2025-05-07T20:33:11.8666398Z     compiled=False,
2025-05-07T20:33:11.8666479Z )
2025-05-07T20:33:11.8666714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8666910Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:11.8666915Z 
2025-05-07T20:33:11.8666996Z     @given(
2025-05-07T20:33:11.8667143Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8667249Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8667371Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8667503Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8667624Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8667701Z     )
2025-05-07T20:33:11.8667966Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8668067Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8668147Z         self,
2025-05-07T20:33:11.8668238Z         T: int,
2025-05-07T20:33:11.8668318Z         D: int,
2025-05-07T20:33:11.8668425Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8668529Z         contiguous: bool,
2025-05-07T20:33:11.8668620Z         compiled: bool,
2025-05-07T20:33:11.8668712Z     ) -> None:
2025-05-07T20:33:11.8668814Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8668899Z     
2025-05-07T20:33:11.8669118Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8669208Z     
2025-05-07T20:33:11.8669304Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8669444Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8669544Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8669635Z         x0 = x[:, :D]
2025-05-07T20:33:11.8669731Z         x1 = x[:, D:]
2025-05-07T20:33:11.8669811Z     
2025-05-07T20:33:11.8669901Z         if contiguous:
2025-05-07T20:33:11.8670012Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8670110Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8670193Z     
2025-05-07T20:33:11.8670297Z         if scale_ub is not None:
2025-05-07T20:33:11.8670415Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8670565Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8670647Z             )
2025-05-07T20:33:11.8670729Z         else:
2025-05-07T20:33:11.8670838Z             scale_ub_tensor = None
2025-05-07T20:33:11.8670921Z     
2025-05-07T20:33:11.8671060Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8671167Z             op = silu_mul_quant
2025-05-07T20:33:11.8671262Z             if compiled:
2025-05-07T20:33:11.8671372Z                 op = torch.compile(op)
2025-05-07T20:33:11.8671497Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8671577Z     
2025-05-07T20:33:11.8671678Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8671691Z 
2025-05-07T20:33:11.8671795Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8671935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8672109Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8672216Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8672747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8672858Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8673320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8673562Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8673968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8674067Z     kernel = self.compile(
2025-05-07T20:33:11.8674484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8674666Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8674807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8674811Z 
2025-05-07T20:33:11.8675031Z self = <triton.compiler.compiler.ASTSource object at 0x7f994224e2d0>
2025-05-07T20:33:11.8675849Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8676378Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f985783a2a0>}
2025-05-07T20:33:11.8677177Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8677382Z context = <triton._C.libtriton.ir.context object at 0x7f98576533b0>
2025-05-07T20:33:11.8677390Z 
2025-05-07T20:33:11.8677563Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8677838Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8677966Z                            module_map=module_map)
2025-05-07T20:33:11.8678135Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8678239Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8678332Z E       ^
2025-05-07T20:33:11.8678708Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8678716Z 
2025-05-07T20:33:11.8679167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8679171Z 
2025-05-07T20:33:11.8679281Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8679520Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8679615Z     T=4096,
2025-05-07T20:33:11.8679699Z     D=5120,
2025-05-07T20:33:11.8679788Z     scale_ub=1200.0,
2025-05-07T20:33:11.8679887Z     contiguous=False,
2025-05-07T20:33:11.8679976Z     compiled=True,
2025-05-07T20:33:11.8680065Z )
2025-05-07T20:33:11.8680297Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8680480Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:11.8680484Z 
2025-05-07T20:33:11.8680575Z     @given(
2025-05-07T20:33:11.8680701Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8680805Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8680934Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8681056Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8681186Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8681319Z     )
2025-05-07T20:33:11.8681578Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8681692Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8681774Z         self,
2025-05-07T20:33:11.8681858Z         T: int,
2025-05-07T20:33:11.8681949Z         D: int,
2025-05-07T20:33:11.8682133Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8682230Z         contiguous: bool,
2025-05-07T20:33:11.8682328Z         compiled: bool,
2025-05-07T20:33:11.8682411Z     ) -> None:
2025-05-07T20:33:11.8682510Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8682637Z     
2025-05-07T20:33:11.8682813Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8682892Z     
2025-05-07T20:33:11.8682999Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8683133Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8683235Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8683319Z         x0 = x[:, :D]
2025-05-07T20:33:11.8683406Z         x1 = x[:, D:]
2025-05-07T20:33:11.8683489Z     
2025-05-07T20:33:11.8683575Z         if contiguous:
2025-05-07T20:33:11.8683669Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8683770Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8683847Z     
2025-05-07T20:33:11.8683947Z         if scale_ub is not None:
2025-05-07T20:33:11.8684065Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8684204Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8684285Z             )
2025-05-07T20:33:11.8684376Z         else:
2025-05-07T20:33:11.8684475Z             scale_ub_tensor = None
2025-05-07T20:33:11.8684559Z     
2025-05-07T20:33:11.8684694Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8684792Z             op = silu_mul_quant
2025-05-07T20:33:11.8684895Z             if compiled:
2025-05-07T20:33:11.8685001Z                 op = torch.compile(op)
2025-05-07T20:33:11.8685113Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8685204Z     
2025-05-07T20:33:11.8685305Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8685310Z 
2025-05-07T20:33:11.8685413Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8685555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8685668Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8685785Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8686175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8686281Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8686814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8686914Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8687293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8687535Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8687893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8687999Z     kernel = self.compile(
2025-05-07T20:33:11.8688409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8688588Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8688729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8688736Z 
2025-05-07T20:33:11.8688947Z self = <triton.compiler.compiler.ASTSource object at 0x7f994224f980>
2025-05-07T20:33:11.8689816Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8690383Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f985783a520>}
2025-05-07T20:33:11.8691257Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8691464Z context = <triton._C.libtriton.ir.context object at 0x7f9856fd2d70>
2025-05-07T20:33:11.8691538Z 
2025-05-07T20:33:11.8691713Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8691997Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8692109Z                            module_map=module_map)
2025-05-07T20:33:11.8692276Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8692394Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8692477Z E       ^
2025-05-07T20:33:11.8692848Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8692861Z 
2025-05-07T20:33:11.8693306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8693310Z 
2025-05-07T20:33:11.8693424Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8693663Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8693749Z     T=2048,
2025-05-07T20:33:11.8693831Z     D=7168,
2025-05-07T20:33:11.8693932Z     scale_ub=1200.0,
2025-05-07T20:33:11.8694021Z     contiguous=False,
2025-05-07T20:33:11.8694110Z     compiled=False,
2025-05-07T20:33:11.8694197Z )
2025-05-07T20:33:11.8694568Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8694768Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:11.8694773Z 
2025-05-07T20:33:11.8694862Z     @given(
2025-05-07T20:33:11.8694982Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8695083Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8695213Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8695332Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8695455Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8695533Z     )
2025-05-07T20:33:11.8695803Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8695901Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8695982Z         self,
2025-05-07T20:33:11.8696069Z         T: int,
2025-05-07T20:33:11.8696149Z         D: int,
2025-05-07T20:33:11.8696249Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8696347Z         contiguous: bool,
2025-05-07T20:33:11.8696439Z         compiled: bool,
2025-05-07T20:33:11.8696524Z     ) -> None:
2025-05-07T20:33:11.8696622Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8696698Z     
2025-05-07T20:33:11.8696880Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8696958Z     
2025-05-07T20:33:11.8697058Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8697196Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8697287Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8697366Z         x0 = x[:, :D]
2025-05-07T20:33:11.8697453Z         x1 = x[:, D:]
2025-05-07T20:33:11.8697529Z     
2025-05-07T20:33:11.8697613Z         if contiguous:
2025-05-07T20:33:11.8697713Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8697805Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8697881Z     
2025-05-07T20:33:11.8697981Z         if scale_ub is not None:
2025-05-07T20:33:11.8698090Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8698233Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8698374Z             )
2025-05-07T20:33:11.8698453Z         else:
2025-05-07T20:33:11.8698556Z             scale_ub_tensor = None
2025-05-07T20:33:11.8698633Z     
2025-05-07T20:33:11.8698769Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8698966Z             op = silu_mul_quant
2025-05-07T20:33:11.8699071Z             if compiled:
2025-05-07T20:33:11.8699192Z                 op = torch.compile(op)
2025-05-07T20:33:11.8699330Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8699458Z     
2025-05-07T20:33:11.8699554Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8699567Z 
2025-05-07T20:33:11.8699668Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8699799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8699911Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8700015Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8700543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8700648Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8701031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8701267Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8701625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8701725Z     kernel = self.compile(
2025-05-07T20:33:11.8702134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8702312Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8702441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8702448Z 
2025-05-07T20:33:11.8702662Z self = <triton.compiler.compiler.ASTSource object at 0x7f99422a52b0>
2025-05-07T20:33:11.8703475Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8703994Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f985783bec0>}
2025-05-07T20:33:11.8704786Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8704985Z context = <triton._C.libtriton.ir.context object at 0x7f9856f517b0>
2025-05-07T20:33:11.8704990Z 
2025-05-07T20:33:11.8705159Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8705434Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8705548Z                            module_map=module_map)
2025-05-07T20:33:11.8705715Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8705822Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8705907Z E       ^
2025-05-07T20:33:11.8706276Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8706283Z 
2025-05-07T20:33:11.8706727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8706731Z 
2025-05-07T20:33:11.8706838Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8707069Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8707157Z     T=1,
2025-05-07T20:33:11.8707290Z     D=7168,
2025-05-07T20:33:11.8707381Z     scale_ub=None,
2025-05-07T20:33:11.8707477Z     contiguous=True,
2025-05-07T20:33:11.8707568Z     compiled=False,
2025-05-07T20:33:11.8707657Z )
2025-05-07T20:33:11.8707886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8708137Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:11.8708143Z 
2025-05-07T20:33:11.8708235Z     @given(
2025-05-07T20:33:11.8708359Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8708507Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8708635Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8708758Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8708876Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8708965Z     )
2025-05-07T20:33:11.8709220Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8709329Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8709412Z         self,
2025-05-07T20:33:11.8709496Z         T: int,
2025-05-07T20:33:11.8709588Z         D: int,
2025-05-07T20:33:11.8709692Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8709787Z         contiguous: bool,
2025-05-07T20:33:11.8709889Z         compiled: bool,
2025-05-07T20:33:11.8709971Z     ) -> None:
2025-05-07T20:33:11.8710069Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8710152Z     
2025-05-07T20:33:11.8710322Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8710402Z     
2025-05-07T20:33:11.8710505Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8710632Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8710733Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8710819Z         x0 = x[:, :D]
2025-05-07T20:33:11.8710904Z         x1 = x[:, D:]
2025-05-07T20:33:11.8710991Z     
2025-05-07T20:33:11.8711082Z         if contiguous:
2025-05-07T20:33:11.8711183Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8711287Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8711368Z     
2025-05-07T20:33:11.8711464Z         if scale_ub is not None:
2025-05-07T20:33:11.8711582Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8711727Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8711810Z             )
2025-05-07T20:33:11.8711900Z         else:
2025-05-07T20:33:11.8712001Z             scale_ub_tensor = None
2025-05-07T20:33:11.8712079Z     
2025-05-07T20:33:11.8712221Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8712316Z             op = silu_mul_quant
2025-05-07T20:33:11.8712413Z             if compiled:
2025-05-07T20:33:11.8712516Z                 op = torch.compile(op)
2025-05-07T20:33:11.8712625Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8712706Z     
2025-05-07T20:33:11.8712800Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8712806Z 
2025-05-07T20:33:11.8712907Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8713049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8713153Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8713254Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8713792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8713893Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8714281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8714513Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8714870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8714973Z     kernel = self.compile(
2025-05-07T20:33:11.8715434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8715617Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8715749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8715829Z 
2025-05-07T20:33:11.8716040Z self = <triton.compiler.compiler.ASTSource object at 0x7f99422a75c0>
2025-05-07T20:33:11.8716858Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8717414Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857c73240>}
2025-05-07T20:33:11.8718213Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8718410Z context = <triton._C.libtriton.ir.context object at 0x7f98576fba30>
2025-05-07T20:33:11.8718415Z 
2025-05-07T20:33:11.8718592Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8718870Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8718981Z                            module_map=module_map)
2025-05-07T20:33:11.8719180Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8719293Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8719383Z E       ^
2025-05-07T20:33:11.8719759Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8719763Z 
2025-05-07T20:33:11.8720199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8720207Z 
2025-05-07T20:33:11.8720319Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8720547Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8720628Z     T=16384,
2025-05-07T20:33:11.8720720Z     D=7168,
2025-05-07T20:33:11.8720806Z     scale_ub=1200.0,
2025-05-07T20:33:11.8720894Z     contiguous=False,
2025-05-07T20:33:11.8720986Z     compiled=True,
2025-05-07T20:33:11.8721060Z )
2025-05-07T20:33:11.8721284Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8721476Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:11.8721481Z 
2025-05-07T20:33:11.8721561Z     @given(
2025-05-07T20:33:11.8721688Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8721794Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8721911Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8722038Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8722153Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8722230Z     )
2025-05-07T20:33:11.8722492Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8722589Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8722674Z         self,
2025-05-07T20:33:11.8722760Z         T: int,
2025-05-07T20:33:11.8722838Z         D: int,
2025-05-07T20:33:11.8722944Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8723039Z         contiguous: bool,
2025-05-07T20:33:11.8723127Z         compiled: bool,
2025-05-07T20:33:11.8723210Z     ) -> None:
2025-05-07T20:33:11.8723306Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8723382Z     
2025-05-07T20:33:11.8723559Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8723636Z     
2025-05-07T20:33:11.8723729Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8723911Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8724006Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8724088Z         x0 = x[:, :D]
2025-05-07T20:33:11.8724180Z         x1 = x[:, D:]
2025-05-07T20:33:11.8724256Z     
2025-05-07T20:33:11.8724448Z         if contiguous:
2025-05-07T20:33:11.8724550Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8724642Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8724724Z     
2025-05-07T20:33:11.8724815Z         if scale_ub is not None:
2025-05-07T20:33:11.8724967Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8725108Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8725185Z             )
2025-05-07T20:33:11.8725266Z         else:
2025-05-07T20:33:11.8725368Z             scale_ub_tensor = None
2025-05-07T20:33:11.8725756Z     
2025-05-07T20:33:11.8725945Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8726088Z             op = silu_mul_quant
2025-05-07T20:33:11.8726189Z             if compiled:
2025-05-07T20:33:11.8726292Z                 op = torch.compile(op)
2025-05-07T20:33:11.8726408Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8726487Z     
2025-05-07T20:33:11.8726587Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8726599Z 
2025-05-07T20:33:11.8726700Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8726834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8726945Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8727052Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8727438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8727540Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8728062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8728170Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8728543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8728769Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8729141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8729234Z     kernel = self.compile(
2025-05-07T20:33:11.8729636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8729823Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8729954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8729958Z 
2025-05-07T20:33:11.8730171Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942782270>
2025-05-07T20:33:11.8730981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8731498Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857c71620>}
2025-05-07T20:33:11.8732295Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8732491Z context = <triton._C.libtriton.ir.context object at 0x7f9857601670>
2025-05-07T20:33:11.8732495Z 
2025-05-07T20:33:11.8732669Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8732941Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8733235Z                            module_map=module_map)
2025-05-07T20:33:11.8733399Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8733501Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8733592Z E       ^
2025-05-07T20:33:11.8734093Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8734098Z 
2025-05-07T20:33:11.8734616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8734692Z 
2025-05-07T20:33:11.8734802Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8735029Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8735116Z     T=1,
2025-05-07T20:33:11.8735192Z     D=7168,
2025-05-07T20:33:11.8735276Z     scale_ub=None,
2025-05-07T20:33:11.8735367Z     contiguous=False,
2025-05-07T20:33:11.8735453Z     compiled=False,
2025-05-07T20:33:11.8735528Z )
2025-05-07T20:33:11.8735759Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8735928Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:11.8735933Z 
2025-05-07T20:33:11.8736018Z     @given(
2025-05-07T20:33:11.8736142Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8736240Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8736358Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8736476Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8736586Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8736667Z     )
2025-05-07T20:33:11.8736915Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8737007Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8737089Z         self,
2025-05-07T20:33:11.8737169Z         T: int,
2025-05-07T20:33:11.8737246Z         D: int,
2025-05-07T20:33:11.8737350Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8737437Z         contiguous: bool,
2025-05-07T20:33:11.8737521Z         compiled: bool,
2025-05-07T20:33:11.8737604Z     ) -> None:
2025-05-07T20:33:11.8737697Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8737782Z     
2025-05-07T20:33:11.8737954Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8738026Z     
2025-05-07T20:33:11.8738129Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8738253Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8738343Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8738430Z         x0 = x[:, :D]
2025-05-07T20:33:11.8738508Z         x1 = x[:, D:]
2025-05-07T20:33:11.8738580Z     
2025-05-07T20:33:11.8738666Z         if contiguous:
2025-05-07T20:33:11.8738758Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8738847Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8738928Z     
2025-05-07T20:33:11.8739020Z         if scale_ub is not None:
2025-05-07T20:33:11.8739131Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8739265Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8739341Z             )
2025-05-07T20:33:11.8739431Z         else:
2025-05-07T20:33:11.8739526Z             scale_ub_tensor = None
2025-05-07T20:33:11.8739599Z     
2025-05-07T20:33:11.8739738Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8739828Z             op = silu_mul_quant
2025-05-07T20:33:11.8739916Z             if compiled:
2025-05-07T20:33:11.8740021Z                 op = torch.compile(op)
2025-05-07T20:33:11.8740130Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8740204Z     
2025-05-07T20:33:11.8740299Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8740304Z 
2025-05-07T20:33:11.8740403Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8740541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8740691Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8740792Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8741404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8741505Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8741879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8742114Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8742511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8742612Z     kernel = self.compile(
2025-05-07T20:33:11.8743012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8743191Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8743326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8743330Z 
2025-05-07T20:33:11.8743535Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942781610>
2025-05-07T20:33:11.8744358Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8744875Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857c73ba0>}
2025-05-07T20:33:11.8745663Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8745869Z context = <triton._C.libtriton.ir.context object at 0x7f98576979f0>
2025-05-07T20:33:11.8745873Z 
2025-05-07T20:33:11.8746042Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8746323Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8746432Z                            module_map=module_map)
2025-05-07T20:33:11.8746594Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8746701Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8746781Z E       ^
2025-05-07T20:33:11.8747157Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8747162Z 
2025-05-07T20:33:11.8747594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8747598Z 
2025-05-07T20:33:11.8747705Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8747943Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8748022Z     T=2048,
2025-05-07T20:33:11.8748099Z     D=7168,
2025-05-07T20:33:11.8748192Z     scale_ub=None,
2025-05-07T20:33:11.8748278Z     contiguous=False,
2025-05-07T20:33:11.8748372Z     compiled=True,
2025-05-07T20:33:11.8748445Z )
2025-05-07T20:33:11.8748669Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8748850Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:11.8748858Z 
2025-05-07T20:33:11.8748932Z     @given(
2025-05-07T20:33:11.8749050Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8749158Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8749272Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8749385Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8749554Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8749625Z     )
2025-05-07T20:33:11.8749880Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8749973Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8750050Z         self,
2025-05-07T20:33:11.8750211Z         T: int,
2025-05-07T20:33:11.8750292Z         D: int,
2025-05-07T20:33:11.8750389Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8750485Z         contiguous: bool,
2025-05-07T20:33:11.8750571Z         compiled: bool,
2025-05-07T20:33:11.8750693Z     ) -> None:
2025-05-07T20:33:11.8750790Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8750863Z     
2025-05-07T20:33:11.8751031Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8751112Z     
2025-05-07T20:33:11.8751205Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8751335Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8751427Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8751506Z         x0 = x[:, :D]
2025-05-07T20:33:11.8751589Z         x1 = x[:, D:]
2025-05-07T20:33:11.8751664Z     
2025-05-07T20:33:11.8751744Z         if contiguous:
2025-05-07T20:33:11.8751840Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8751929Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8752008Z     
2025-05-07T20:33:11.8752104Z         if scale_ub is not None:
2025-05-07T20:33:11.8752209Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8752343Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8752434Z             )
2025-05-07T20:33:11.8752511Z         else:
2025-05-07T20:33:11.8752604Z             scale_ub_tensor = None
2025-05-07T20:33:11.8752685Z     
2025-05-07T20:33:11.8752814Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8752913Z             op = silu_mul_quant
2025-05-07T20:33:11.8752999Z             if compiled:
2025-05-07T20:33:11.8753102Z                 op = torch.compile(op)
2025-05-07T20:33:11.8753219Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8753294Z     
2025-05-07T20:33:11.8753388Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8753392Z 
2025-05-07T20:33:11.8753499Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8753636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8753741Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8753850Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8754235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8754338Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8754856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8754955Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8755335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8755564Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8755930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8756028Z     kernel = self.compile(
2025-05-07T20:33:11.8756428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8756610Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8756755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8756760Z 
2025-05-07T20:33:11.8756966Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942782360>
2025-05-07T20:33:11.8757780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8764510Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f99436e2ca0>}
2025-05-07T20:33:11.8765497Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8765742Z context = <triton._C.libtriton.ir.context object at 0x7f99420735b0>
2025-05-07T20:33:11.8765747Z 
2025-05-07T20:33:11.8765921Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8766205Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8766316Z                            module_map=module_map)
2025-05-07T20:33:11.8766499Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8766604Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8766687Z E       ^
2025-05-07T20:33:11.8767069Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8767081Z 
2025-05-07T20:33:11.8767520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8767525Z 
2025-05-07T20:33:11.8767637Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8767884Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8767971Z     T=4096,
2025-05-07T20:33:11.8768066Z     D=7168,
2025-05-07T20:33:11.8768155Z     scale_ub=None,
2025-05-07T20:33:11.8768248Z     contiguous=False,
2025-05-07T20:33:11.8768343Z     compiled=True,
2025-05-07T20:33:11.8768426Z )
2025-05-07T20:33:11.8768653Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8768847Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:11.8768852Z 
2025-05-07T20:33:11.8768934Z     @given(
2025-05-07T20:33:11.8769057Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8769175Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8769295Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8769427Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8769544Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8769626Z     )
2025-05-07T20:33:11.8769892Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8769996Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8770076Z         self,
2025-05-07T20:33:11.8770162Z         T: int,
2025-05-07T20:33:11.8770242Z         D: int,
2025-05-07T20:33:11.8770345Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8770451Z         contiguous: bool,
2025-05-07T20:33:11.8770543Z         compiled: bool,
2025-05-07T20:33:11.8770626Z     ) -> None:
2025-05-07T20:33:11.8770732Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8770810Z     
2025-05-07T20:33:11.8770992Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8771074Z     
2025-05-07T20:33:11.8771173Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8771309Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8771402Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8771488Z         x0 = x[:, :D]
2025-05-07T20:33:11.8771579Z         x1 = x[:, D:]
2025-05-07T20:33:11.8771654Z     
2025-05-07T20:33:11.8771742Z         if contiguous:
2025-05-07T20:33:11.8771844Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8771938Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8772017Z     
2025-05-07T20:33:11.8772119Z         if scale_ub is not None:
2025-05-07T20:33:11.8772228Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8772429Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8772509Z             )
2025-05-07T20:33:11.8772588Z         else:
2025-05-07T20:33:11.8772692Z             scale_ub_tensor = None
2025-05-07T20:33:11.8772768Z     
2025-05-07T20:33:11.8772981Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8773083Z             op = silu_mul_quant
2025-05-07T20:33:11.8773173Z             if compiled:
2025-05-07T20:33:11.8773278Z                 op = torch.compile(op)
2025-05-07T20:33:11.8773437Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8773513Z     
2025-05-07T20:33:11.8773610Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8773622Z 
2025-05-07T20:33:11.8773723Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8773859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8773973Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8774079Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8774609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8774716Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8775244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8775346Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8775730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8775965Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8776332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8776431Z     kernel = self.compile(
2025-05-07T20:33:11.8776836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8777026Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8777159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8777163Z 
2025-05-07T20:33:11.8777384Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942782330>
2025-05-07T20:33:11.8778197Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8778717Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9942a4f240>}
2025-05-07T20:33:11.8779525Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8779724Z context = <triton._C.libtriton.ir.context object at 0x7f99420d29b0>
2025-05-07T20:33:11.8779729Z 
2025-05-07T20:33:11.8779912Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8780192Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8780301Z                            module_map=module_map)
2025-05-07T20:33:11.8780474Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8780578Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8780671Z E       ^
2025-05-07T20:33:11.8781043Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8781047Z 
2025-05-07T20:33:11.8781486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8781540Z 
2025-05-07T20:33:11.8781657Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8781886Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8781982Z     T=16384,
2025-05-07T20:33:11.8782067Z     D=5120,
2025-05-07T20:33:11.8782235Z     scale_ub=1200.0,
2025-05-07T20:33:11.8782336Z     contiguous=False,
2025-05-07T20:33:11.8782425Z     compiled=False,
2025-05-07T20:33:11.8782503Z )
2025-05-07T20:33:11.8782739Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8782971Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:11.8782976Z 
2025-05-07T20:33:11.8783058Z     @given(
2025-05-07T20:33:11.8783191Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8783295Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8783414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8783546Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8783662Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8783751Z     )
2025-05-07T20:33:11.8784004Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8784107Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8784196Z         self,
2025-05-07T20:33:11.8784279Z         T: int,
2025-05-07T20:33:11.8784361Z         D: int,
2025-05-07T20:33:11.8784471Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8784565Z         contiguous: bool,
2025-05-07T20:33:11.8784657Z         compiled: bool,
2025-05-07T20:33:11.8784745Z     ) -> None:
2025-05-07T20:33:11.8784844Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8784919Z     
2025-05-07T20:33:11.8785100Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8785177Z     
2025-05-07T20:33:11.8785282Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8785411Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8785509Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8785603Z         x0 = x[:, :D]
2025-05-07T20:33:11.8785690Z         x1 = x[:, D:]
2025-05-07T20:33:11.8785770Z     
2025-05-07T20:33:11.8785870Z         if contiguous:
2025-05-07T20:33:11.8785974Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8786071Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8786162Z     
2025-05-07T20:33:11.8786260Z         if scale_ub is not None:
2025-05-07T20:33:11.8786373Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8786520Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8786598Z             )
2025-05-07T20:33:11.8786684Z         else:
2025-05-07T20:33:11.8786783Z             scale_ub_tensor = None
2025-05-07T20:33:11.8786858Z     
2025-05-07T20:33:11.8786995Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8787089Z             op = silu_mul_quant
2025-05-07T20:33:11.8787180Z             if compiled:
2025-05-07T20:33:11.8787291Z                 op = torch.compile(op)
2025-05-07T20:33:11.8787398Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8787478Z     
2025-05-07T20:33:11.8787578Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8787583Z 
2025-05-07T20:33:11.8787687Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8787828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8787934Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8788034Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8788575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8788675Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8789055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8789292Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8789705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8789811Z     kernel = self.compile(
2025-05-07T20:33:11.8790292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8790472Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8790613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8790658Z 
2025-05-07T20:33:11.8790868Z self = <triton.compiler.compiler.ASTSource object at 0x7f994825df10>
2025-05-07T20:33:11.8791685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8792204Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9942a4e840>}
2025-05-07T20:33:11.8792999Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8793205Z context = <triton._C.libtriton.ir.context object at 0x7f9857a00e70>
2025-05-07T20:33:11.8793210Z 
2025-05-07T20:33:11.8793385Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8793667Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8793780Z                            module_map=module_map)
2025-05-07T20:33:11.8793946Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8794060Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8794144Z E       ^
2025-05-07T20:33:11.8794516Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8794529Z 
2025-05-07T20:33:11.8794969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8794974Z 
2025-05-07T20:33:11.8795081Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8795322Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8795411Z     T=16384,
2025-05-07T20:33:11.8795494Z     D=5120,
2025-05-07T20:33:11.8795591Z     scale_ub=1200.0,
2025-05-07T20:33:11.8795680Z     contiguous=True,
2025-05-07T20:33:11.8795766Z     compiled=True,
2025-05-07T20:33:11.8795855Z )
2025-05-07T20:33:11.8796083Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8796277Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:11.8796287Z 
2025-05-07T20:33:11.8796371Z     @given(
2025-05-07T20:33:11.8796493Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8796606Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8796723Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8796849Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8796973Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8797051Z     )
2025-05-07T20:33:11.8797305Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8797412Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8797492Z         self,
2025-05-07T20:33:11.8797581Z         T: int,
2025-05-07T20:33:11.8797664Z         D: int,
2025-05-07T20:33:11.8797773Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8797874Z         contiguous: bool,
2025-05-07T20:33:11.8797964Z         compiled: bool,
2025-05-07T20:33:11.8798127Z     ) -> None:
2025-05-07T20:33:11.8798236Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8798314Z     
2025-05-07T20:33:11.8798489Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8798578Z     
2025-05-07T20:33:11.8798676Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8798889Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8798999Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8799081Z         x0 = x[:, :D]
2025-05-07T20:33:11.8799177Z         x1 = x[:, D:]
2025-05-07T20:33:11.8799253Z     
2025-05-07T20:33:11.8799381Z         if contiguous:
2025-05-07T20:33:11.8799492Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8799589Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8799670Z     
2025-05-07T20:33:11.8799776Z         if scale_ub is not None:
2025-05-07T20:33:11.8799889Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8800030Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8800127Z             )
2025-05-07T20:33:11.8800212Z         else:
2025-05-07T20:33:11.8800314Z             scale_ub_tensor = None
2025-05-07T20:33:11.8800403Z     
2025-05-07T20:33:11.8800537Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8800635Z             op = silu_mul_quant
2025-05-07T20:33:11.8800742Z             if compiled:
2025-05-07T20:33:11.8800850Z                 op = torch.compile(op)
2025-05-07T20:33:11.8800973Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8801054Z     
2025-05-07T20:33:11.8801150Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8801158Z 
2025-05-07T20:33:11.8801268Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8801404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8801513Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8801630Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8802018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8802128Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8802650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8802755Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8803136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8803367Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8803729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8803837Z     kernel = self.compile(
2025-05-07T20:33:11.8804241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8804433Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8804566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8804570Z 
2025-05-07T20:33:11.8804781Z self = <triton.compiler.compiler.ASTSource object at 0x7f9943d05c40>
2025-05-07T20:33:11.8805602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8806118Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9943fc6ca0>}
2025-05-07T20:33:11.8806918Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8807171Z context = <triton._C.libtriton.ir.context object at 0x7f9857ad1230>
2025-05-07T20:33:11.8807176Z 
2025-05-07T20:33:11.8807357Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8807633Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8807823Z                            module_map=module_map)
2025-05-07T20:33:11.8808001Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8808108Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8808193Z E       ^
2025-05-07T20:33:11.8808618Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8808622Z 
2025-05-07T20:33:11.8809058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8809062Z 
2025-05-07T20:33:11.8809180Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8809418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8809501Z     T=16384,
2025-05-07T20:33:11.8809591Z     D=5120,
2025-05-07T20:33:11.8809677Z     scale_ub=None,
2025-05-07T20:33:11.8809768Z     contiguous=False,
2025-05-07T20:33:11.8809861Z     compiled=True,
2025-05-07T20:33:11.8809943Z )
2025-05-07T20:33:11.8810171Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8810361Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:11.8810369Z 
2025-05-07T20:33:11.8810450Z     @given(
2025-05-07T20:33:11.8810580Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8810683Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8810802Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8810930Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8811048Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8811128Z     )
2025-05-07T20:33:11.8811391Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8811491Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8811572Z         self,
2025-05-07T20:33:11.8811661Z         T: int,
2025-05-07T20:33:11.8811746Z         D: int,
2025-05-07T20:33:11.8811857Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8811951Z         contiguous: bool,
2025-05-07T20:33:11.8812040Z         compiled: bool,
2025-05-07T20:33:11.8812134Z     ) -> None:
2025-05-07T20:33:11.8812234Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8812313Z     
2025-05-07T20:33:11.8812495Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8812574Z     
2025-05-07T20:33:11.8812673Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8812809Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8812904Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8812992Z         x0 = x[:, :D]
2025-05-07T20:33:11.8813088Z         x1 = x[:, D:]
2025-05-07T20:33:11.8813165Z     
2025-05-07T20:33:11.8813256Z         if contiguous:
2025-05-07T20:33:11.8813361Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8813453Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8813538Z     
2025-05-07T20:33:11.8813639Z         if scale_ub is not None:
2025-05-07T20:33:11.8813750Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8813896Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8813976Z             )
2025-05-07T20:33:11.8814059Z         else:
2025-05-07T20:33:11.8814165Z             scale_ub_tensor = None
2025-05-07T20:33:11.8814241Z     
2025-05-07T20:33:11.8814527Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8814634Z             op = silu_mul_quant
2025-05-07T20:33:11.8814722Z             if compiled:
2025-05-07T20:33:11.8814824Z                 op = torch.compile(op)
2025-05-07T20:33:11.8814993Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8815068Z     
2025-05-07T20:33:11.8815171Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8815175Z 
2025-05-07T20:33:11.8815276Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8815410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8815608Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8815714Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8816104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8816251Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8816776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8816885Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8817265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8817501Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8817869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8817968Z     kernel = self.compile(
2025-05-07T20:33:11.8818377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8818566Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8818706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8818710Z 
2025-05-07T20:33:11.8818926Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857790c50>
2025-05-07T20:33:11.8819735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8820261Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857b74b80>}
2025-05-07T20:33:11.8821056Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8821255Z context = <triton._C.libtriton.ir.context object at 0x7f9857a9bcf0>
2025-05-07T20:33:11.8821262Z 
2025-05-07T20:33:11.8821441Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8821714Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8821831Z                            module_map=module_map)
2025-05-07T20:33:11.8821995Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8822102Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8822191Z E       ^
2025-05-07T20:33:11.8822562Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8822567Z 
2025-05-07T20:33:11.8823010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8823021Z 
2025-05-07T20:33:11.8823134Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8823362Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8823455Z     T=2048,
2025-05-07T20:33:11.8823536Z     D=5120,
2025-05-07T20:33:11.8823619Z     scale_ub=None,
2025-05-07T20:33:11.8823721Z     contiguous=False,
2025-05-07T20:33:11.8823809Z     compiled=True,
2025-05-07T20:33:11.8823886Z )
2025-05-07T20:33:11.8824119Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8824349Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:11.8824354Z 
2025-05-07T20:33:11.8824444Z     @given(
2025-05-07T20:33:11.8824565Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8824673Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8824880Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8825002Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8825120Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8825207Z     )
2025-05-07T20:33:11.8826381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8826749Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8826842Z         self,
2025-05-07T20:33:11.8826928Z         T: int,
2025-05-07T20:33:11.8827011Z         D: int,
2025-05-07T20:33:11.8827131Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8827228Z         contiguous: bool,
2025-05-07T20:33:11.8827354Z         compiled: bool,
2025-05-07T20:33:11.8827451Z     ) -> None:
2025-05-07T20:33:11.8827553Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8827630Z     
2025-05-07T20:33:11.8827829Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8827907Z     
2025-05-07T20:33:11.8828011Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8828146Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8828254Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8828335Z         x0 = x[:, :D]
2025-05-07T20:33:11.8828421Z         x1 = x[:, D:]
2025-05-07T20:33:11.8828504Z     
2025-05-07T20:33:11.8828588Z         if contiguous:
2025-05-07T20:33:11.8828682Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8828779Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8828852Z     
2025-05-07T20:33:11.8828942Z         if scale_ub is not None:
2025-05-07T20:33:11.8829057Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8829197Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8829275Z             )
2025-05-07T20:33:11.8829360Z         else:
2025-05-07T20:33:11.8829455Z             scale_ub_tensor = None
2025-05-07T20:33:11.8829525Z     
2025-05-07T20:33:11.8829669Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8829766Z             op = silu_mul_quant
2025-05-07T20:33:11.8829856Z             if compiled:
2025-05-07T20:33:11.8829959Z                 op = torch.compile(op)
2025-05-07T20:33:11.8830065Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8830151Z     
2025-05-07T20:33:11.8830242Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8830248Z 
2025-05-07T20:33:11.8830346Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8830490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8830592Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8830697Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8831102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8831198Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8831727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8831828Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8832204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8832443Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8832800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8832902Z     kernel = self.compile(
2025-05-07T20:33:11.8833304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8833789Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8833928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8833934Z 
2025-05-07T20:33:11.8834139Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857791220>
2025-05-07T20:33:11.8835111Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8835707Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857b760c0>}
2025-05-07T20:33:11.8836504Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8836709Z context = <triton._C.libtriton.ir.context object at 0x7f9857158fb0>
2025-05-07T20:33:11.8836714Z 
2025-05-07T20:33:11.8836882Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8837166Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8837272Z                            module_map=module_map)
2025-05-07T20:33:11.8837436Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8837542Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8837622Z E       ^
2025-05-07T20:33:11.8837995Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8838007Z 
2025-05-07T20:33:11.8838444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8838450Z 
2025-05-07T20:33:11.8838556Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8838792Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8838875Z     T=2048,
2025-05-07T20:33:11.8838953Z     D=5120,
2025-05-07T20:33:11.8839044Z     scale_ub=1200.0,
2025-05-07T20:33:11.8839137Z     contiguous=False,
2025-05-07T20:33:11.8839223Z     compiled=True,
2025-05-07T20:33:11.8839323Z )
2025-05-07T20:33:11.8839640Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8839865Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:11.8839874Z 
2025-05-07T20:33:11.8839953Z     @given(
2025-05-07T20:33:11.8840076Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8840187Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8840304Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8840423Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8840549Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8840630Z     )
2025-05-07T20:33:11.8840885Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8840989Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8841068Z         self,
2025-05-07T20:33:11.8841160Z         T: int,
2025-05-07T20:33:11.8841241Z         D: int,
2025-05-07T20:33:11.8841344Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8841443Z         contiguous: bool,
2025-05-07T20:33:11.8841531Z         compiled: bool,
2025-05-07T20:33:11.8841613Z     ) -> None:
2025-05-07T20:33:11.8841718Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8841793Z     
2025-05-07T20:33:11.8841966Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8842054Z     
2025-05-07T20:33:11.8842151Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8842277Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8842437Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8842519Z         x0 = x[:, :D]
2025-05-07T20:33:11.8842609Z         x1 = x[:, D:]
2025-05-07T20:33:11.8842684Z     
2025-05-07T20:33:11.8842773Z         if contiguous:
2025-05-07T20:33:11.8842873Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8843042Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8843120Z     
2025-05-07T20:33:11.8843224Z         if scale_ub is not None:
2025-05-07T20:33:11.8843332Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8843469Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8843601Z             )
2025-05-07T20:33:11.8843679Z         else:
2025-05-07T20:33:11.8843776Z             scale_ub_tensor = None
2025-05-07T20:33:11.8843858Z     
2025-05-07T20:33:11.8843989Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8844078Z             op = silu_mul_quant
2025-05-07T20:33:11.8844168Z             if compiled:
2025-05-07T20:33:11.8844271Z                 op = torch.compile(op)
2025-05-07T20:33:11.8844385Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8844462Z     
2025-05-07T20:33:11.8844554Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8844558Z 
2025-05-07T20:33:11.8844660Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8844796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8844897Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8845009Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8845394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8845496Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8846016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8846114Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8846494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8846725Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8847086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8847188Z     kernel = self.compile(
2025-05-07T20:33:11.8847590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8847772Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8847905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8847910Z 
2025-05-07T20:33:11.8848117Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942437f20>
2025-05-07T20:33:11.8848937Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8849451Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9857b772e0>}
2025-05-07T20:33:11.8850256Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8850450Z context = <triton._C.libtriton.ir.context object at 0x7f98571dfcb0>
2025-05-07T20:33:11.8850455Z 
2025-05-07T20:33:11.8850630Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8850902Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8851008Z                            module_map=module_map)
2025-05-07T20:33:11.8851227Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8851326Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8851401Z E       ^
2025-05-07T20:33:11.8851779Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8851861Z 
2025-05-07T20:33:11.8852300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8852305Z 
2025-05-07T20:33:11.8852417Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8852684Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8852765Z     T=4096,
2025-05-07T20:33:11.8852851Z     D=5120,
2025-05-07T20:33:11.8852940Z     scale_ub=1200.0,
2025-05-07T20:33:11.8853026Z     contiguous=True,
2025-05-07T20:33:11.8853121Z     compiled=True,
2025-05-07T20:33:11.8853198Z )
2025-05-07T20:33:11.8853426Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8853614Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:11.8853619Z 
2025-05-07T20:33:11.8853698Z     @given(
2025-05-07T20:33:11.8853824Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8853934Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8854053Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8854181Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8854296Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8854482Z     )
2025-05-07T20:33:11.8854746Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8854843Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8854924Z         self,
2025-05-07T20:33:11.8855014Z         T: int,
2025-05-07T20:33:11.8855093Z         D: int,
2025-05-07T20:33:11.8855206Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8855303Z         contiguous: bool,
2025-05-07T20:33:11.8855392Z         compiled: bool,
2025-05-07T20:33:11.8855478Z     ) -> None:
2025-05-07T20:33:11.8855572Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8855644Z     
2025-05-07T20:33:11.8855834Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8855910Z     
2025-05-07T20:33:11.8856006Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8856142Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8856230Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8856317Z         x0 = x[:, :D]
2025-05-07T20:33:11.8856404Z         x1 = x[:, D:]
2025-05-07T20:33:11.8856477Z     
2025-05-07T20:33:11.8856572Z         if contiguous:
2025-05-07T20:33:11.8856664Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8856753Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8856836Z     
2025-05-07T20:33:11.8856928Z         if scale_ub is not None:
2025-05-07T20:33:11.8857038Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8857184Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8857262Z             )
2025-05-07T20:33:11.8857341Z         else:
2025-05-07T20:33:11.8857446Z             scale_ub_tensor = None
2025-05-07T20:33:11.8857521Z     
2025-05-07T20:33:11.8857657Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8857759Z             op = silu_mul_quant
2025-05-07T20:33:11.8857847Z             if compiled:
2025-05-07T20:33:11.8857948Z                 op = torch.compile(op)
2025-05-07T20:33:11.8858067Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8858143Z     
2025-05-07T20:33:11.8858242Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8858247Z 
2025-05-07T20:33:11.8858343Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8858478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8858588Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8858746Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8859137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8859242Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8859933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8860046Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8860424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8860694Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8861060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8861158Z     kernel = self.compile(
2025-05-07T20:33:11.8861564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8861758Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8861892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8861896Z 
2025-05-07T20:33:11.8862122Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942b877d0>
2025-05-07T20:33:11.8862935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8863459Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98570dc860>}
2025-05-07T20:33:11.8864252Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8864451Z context = <triton._C.libtriton.ir.context object at 0x7f9857034770>
2025-05-07T20:33:11.8864455Z 
2025-05-07T20:33:11.8864634Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8864914Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8865032Z                            module_map=module_map)
2025-05-07T20:33:11.8865197Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8865302Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8865384Z E       ^
2025-05-07T20:33:11.8865757Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8865762Z 
2025-05-07T20:33:11.8866201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8866214Z 
2025-05-07T20:33:11.8866323Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8866559Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8866651Z     T=128,
2025-05-07T20:33:11.8866735Z     D=5120,
2025-05-07T20:33:11.8866832Z     scale_ub=1200.0,
2025-05-07T20:33:11.8866933Z     contiguous=False,
2025-05-07T20:33:11.8867020Z     compiled=True,
2025-05-07T20:33:11.8867101Z )
2025-05-07T20:33:11.8867336Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8867518Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:11.8867522Z 
2025-05-07T20:33:11.8867610Z     @given(
2025-05-07T20:33:11.8867732Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8867839Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8867967Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8868136Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8868255Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8868341Z     )
2025-05-07T20:33:11.8868597Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8868768Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8868860Z         self,
2025-05-07T20:33:11.8868943Z         T: int,
2025-05-07T20:33:11.8869026Z         D: int,
2025-05-07T20:33:11.8869138Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8869231Z         contiguous: bool,
2025-05-07T20:33:11.8869370Z         compiled: bool,
2025-05-07T20:33:11.8869450Z     ) -> None:
2025-05-07T20:33:11.8869547Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8869630Z     
2025-05-07T20:33:11.8869802Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8869879Z     
2025-05-07T20:33:11.8869979Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8870110Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8870201Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8870291Z         x0 = x[:, :D]
2025-05-07T20:33:11.8870370Z         x1 = x[:, D:]
2025-05-07T20:33:11.8870445Z     
2025-05-07T20:33:11.8870535Z         if contiguous:
2025-05-07T20:33:11.8870633Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8870731Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8870804Z     
2025-05-07T20:33:11.8870895Z         if scale_ub is not None:
2025-05-07T20:33:11.8871009Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8871152Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8871231Z             )
2025-05-07T20:33:11.8871318Z         else:
2025-05-07T20:33:11.8871417Z             scale_ub_tensor = None
2025-05-07T20:33:11.8871487Z     
2025-05-07T20:33:11.8871625Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8871718Z             op = silu_mul_quant
2025-05-07T20:33:11.8871806Z             if compiled:
2025-05-07T20:33:11.8871914Z                 op = torch.compile(op)
2025-05-07T20:33:11.8872022Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8872100Z     
2025-05-07T20:33:11.8872198Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8872202Z 
2025-05-07T20:33:11.8872308Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8872450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8872553Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8872657Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8873060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8873155Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8873678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8873789Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8874168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8874405Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8874768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8874865Z     kernel = self.compile(
2025-05-07T20:33:11.8875277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8875459Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8875603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8875608Z 
2025-05-07T20:33:11.8875818Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942200fe0>
2025-05-07T20:33:11.8876629Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8877281Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98570dd580>}
2025-05-07T20:33:11.8878079Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8878324Z context = <triton._C.libtriton.ir.context object at 0x7f98570f82b0>
2025-05-07T20:33:11.8878328Z 
2025-05-07T20:33:11.8878502Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8878783Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8878905Z                            module_map=module_map)
2025-05-07T20:33:11.8879081Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8879194Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8879278Z E       ^
2025-05-07T20:33:11.8879660Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8879665Z 
2025-05-07T20:33:11.8880110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8880118Z 
2025-05-07T20:33:11.8880228Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8880469Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8880554Z     T=16384,
2025-05-07T20:33:11.8880637Z     D=7168,
2025-05-07T20:33:11.8880731Z     scale_ub=1200.0,
2025-05-07T20:33:11.8880819Z     contiguous=True,
2025-05-07T20:33:11.8880908Z     compiled=True,
2025-05-07T20:33:11.8880999Z )
2025-05-07T20:33:11.8881228Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8881417Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:11.8881422Z 
2025-05-07T20:33:11.8881514Z     @given(
2025-05-07T20:33:11.8881638Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8881747Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8881867Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8881985Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8882113Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8882187Z     )
2025-05-07T20:33:11.8882441Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8882543Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8882620Z         self,
2025-05-07T20:33:11.8882703Z         T: int,
2025-05-07T20:33:11.8882793Z         D: int,
2025-05-07T20:33:11.8882895Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8882989Z         contiguous: bool,
2025-05-07T20:33:11.8883083Z         compiled: bool,
2025-05-07T20:33:11.8883164Z     ) -> None:
2025-05-07T20:33:11.8883265Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8883337Z     
2025-05-07T20:33:11.8883513Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8883595Z     
2025-05-07T20:33:11.8883690Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8883814Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8883912Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8883993Z         x0 = x[:, :D]
2025-05-07T20:33:11.8884074Z         x1 = x[:, D:]
2025-05-07T20:33:11.8884157Z     
2025-05-07T20:33:11.8884240Z         if contiguous:
2025-05-07T20:33:11.8884329Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8884425Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8884499Z     
2025-05-07T20:33:11.8884640Z         if scale_ub is not None:
2025-05-07T20:33:11.8884758Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8884894Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8884978Z             )
2025-05-07T20:33:11.8885057Z         else:
2025-05-07T20:33:11.8885231Z             scale_ub_tensor = None
2025-05-07T20:33:11.8885313Z     
2025-05-07T20:33:11.8885447Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8885539Z             op = silu_mul_quant
2025-05-07T20:33:11.8885631Z             if compiled:
2025-05-07T20:33:11.8885772Z                 op = torch.compile(op)
2025-05-07T20:33:11.8885878Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8885960Z     
2025-05-07T20:33:11.8886056Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8886061Z 
2025-05-07T20:33:11.8886166Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8886298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8886403Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8886510Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8886903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8886998Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8887531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8887630Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8888017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8888246Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8888604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8888713Z     kernel = self.compile(
2025-05-07T20:33:11.8889119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8889298Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8889440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8889444Z 
2025-05-07T20:33:11.8889682Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942201dc0>
2025-05-07T20:33:11.8890518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8891035Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98570de0c0>}
2025-05-07T20:33:11.8891835Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8892031Z context = <triton._C.libtriton.ir.context object at 0x7f98570f39f0>
2025-05-07T20:33:11.8892035Z 
2025-05-07T20:33:11.8892212Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8892492Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8892600Z                            module_map=module_map)
2025-05-07T20:33:11.8892781Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8892881Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8892962Z E       ^
2025-05-07T20:33:11.8899861Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8899870Z 
2025-05-07T20:33:11.8900340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8900418Z 
2025-05-07T20:33:11.8900530Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8900775Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8900963Z     T=16384,
2025-05-07T20:33:11.8901051Z     D=5120,
2025-05-07T20:33:11.8901152Z     scale_ub=1200.0,
2025-05-07T20:33:11.8901243Z     contiguous=True,
2025-05-07T20:33:11.8901334Z     compiled=False,
2025-05-07T20:33:11.8901425Z )
2025-05-07T20:33:11.8901698Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8901895Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:11.8901900Z 
2025-05-07T20:33:11.8901986Z     @given(
2025-05-07T20:33:11.8902109Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8902228Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8902352Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8902477Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8902602Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8902684Z     )
2025-05-07T20:33:11.8902947Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8903055Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8903138Z         self,
2025-05-07T20:33:11.8903230Z         T: int,
2025-05-07T20:33:11.8903316Z         D: int,
2025-05-07T20:33:11.8903419Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8903522Z         contiguous: bool,
2025-05-07T20:33:11.8903613Z         compiled: bool,
2025-05-07T20:33:11.8903696Z     ) -> None:
2025-05-07T20:33:11.8903809Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8903891Z     
2025-05-07T20:33:11.8904069Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8904161Z     
2025-05-07T20:33:11.8904265Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8904396Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8904498Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8904587Z         x0 = x[:, :D]
2025-05-07T20:33:11.8904674Z         x1 = x[:, D:]
2025-05-07T20:33:11.8904761Z     
2025-05-07T20:33:11.8904857Z         if contiguous:
2025-05-07T20:33:11.8904964Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8905061Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8905141Z     
2025-05-07T20:33:11.8905245Z         if scale_ub is not None:
2025-05-07T20:33:11.8905360Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8905500Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8905588Z             )
2025-05-07T20:33:11.8905670Z         else:
2025-05-07T20:33:11.8905771Z             scale_ub_tensor = None
2025-05-07T20:33:11.8905859Z     
2025-05-07T20:33:11.8905992Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8906089Z             op = silu_mul_quant
2025-05-07T20:33:11.8906186Z             if compiled:
2025-05-07T20:33:11.8906289Z                 op = torch.compile(op)
2025-05-07T20:33:11.8906405Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8906482Z     
2025-05-07T20:33:11.8906582Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8906586Z 
2025-05-07T20:33:11.8906693Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8906824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8906928Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8907039Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8907564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8907674Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8908052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8908334Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8908701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8908876Z     kernel = self.compile(
2025-05-07T20:33:11.8909286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8909478Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8909652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8909656Z 
2025-05-07T20:33:11.8909878Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942200b30>
2025-05-07T20:33:11.8910693Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8911216Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98570df1a0>}
2025-05-07T20:33:11.8914634Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8914849Z context = <triton._C.libtriton.ir.context object at 0x7f98569e80f0>
2025-05-07T20:33:11.8914858Z 
2025-05-07T20:33:11.8915037Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8915312Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8915424Z                            module_map=module_map)
2025-05-07T20:33:11.8915597Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8915703Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8915781Z E       ^
2025-05-07T20:33:11.8916166Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8916171Z 
2025-05-07T20:33:11.8916616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8916620Z 
2025-05-07T20:33:11.8916755Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8916988Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8917074Z     T=1,
2025-05-07T20:33:11.8917165Z     D=7168,
2025-05-07T20:33:11.8917250Z     scale_ub=1200.0,
2025-05-07T20:33:11.8917336Z     contiguous=False,
2025-05-07T20:33:11.8917432Z     compiled=False,
2025-05-07T20:33:11.8917508Z )
2025-05-07T20:33:11.8917735Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8917921Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:11.8917926Z 
2025-05-07T20:33:11.8918007Z     @given(
2025-05-07T20:33:11.8918138Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8918241Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8918364Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8918491Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8918608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8918690Z     )
2025-05-07T20:33:11.8918955Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8919051Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8919130Z         self,
2025-05-07T20:33:11.8919222Z         T: int,
2025-05-07T20:33:11.8919301Z         D: int,
2025-05-07T20:33:11.8919401Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8919499Z         contiguous: bool,
2025-05-07T20:33:11.8919650Z         compiled: bool,
2025-05-07T20:33:11.8919737Z     ) -> None:
2025-05-07T20:33:11.8919841Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8919921Z     
2025-05-07T20:33:11.8920110Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8920193Z     
2025-05-07T20:33:11.8920333Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8920474Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8920569Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8920656Z         x0 = x[:, :D]
2025-05-07T20:33:11.8920795Z         x1 = x[:, D:]
2025-05-07T20:33:11.8920874Z     
2025-05-07T20:33:11.8920966Z         if contiguous:
2025-05-07T20:33:11.8921071Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8921166Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8921249Z     
2025-05-07T20:33:11.8921347Z         if scale_ub is not None:
2025-05-07T20:33:11.8921463Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8921616Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8921697Z             )
2025-05-07T20:33:11.8921783Z         else:
2025-05-07T20:33:11.8921890Z             scale_ub_tensor = None
2025-05-07T20:33:11.8921971Z     
2025-05-07T20:33:11.8922108Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8922210Z             op = silu_mul_quant
2025-05-07T20:33:11.8922295Z             if compiled:
2025-05-07T20:33:11.8922480Z                 op = torch.compile(op)
2025-05-07T20:33:11.8922598Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8922678Z     
2025-05-07T20:33:11.8922778Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8922782Z 
2025-05-07T20:33:11.8922884Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8923017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8923126Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8923227Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8923760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8923867Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8924249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8924489Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8924853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8924951Z     kernel = self.compile(
2025-05-07T20:33:11.8925371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8926342Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8926479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8926494Z 
2025-05-07T20:33:11.8926707Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942a6bcb0>
2025-05-07T20:33:11.8927523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8928052Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9856c30680>}
2025-05-07T20:33:11.8928854Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8929060Z context = <triton._C.libtriton.ir.context object at 0x7f9856ce42b0>
2025-05-07T20:33:11.8929064Z 
2025-05-07T20:33:11.8929335Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8929612Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8929733Z                            module_map=module_map)
2025-05-07T20:33:11.8929966Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8930070Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8930163Z E       ^
2025-05-07T20:33:11.8930543Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8930605Z 
2025-05-07T20:33:11.8931059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8931064Z 
2025-05-07T20:33:11.8931173Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8931408Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8931507Z     T=4096,
2025-05-07T20:33:11.8931590Z     D=7168,
2025-05-07T20:33:11.8931692Z     scale_ub=1200.0,
2025-05-07T20:33:11.8931788Z     contiguous=False,
2025-05-07T20:33:11.8931877Z     compiled=True,
2025-05-07T20:33:11.8931966Z )
2025-05-07T20:33:11.8932199Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8932388Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:11.8932393Z 
2025-05-07T20:33:11.8932488Z     @given(
2025-05-07T20:33:11.8932712Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8932823Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8932956Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8933079Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8933210Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8933293Z     )
2025-05-07T20:33:11.8933553Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8933664Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8933749Z         self,
2025-05-07T20:33:11.8933831Z         T: int,
2025-05-07T20:33:11.8933920Z         D: int,
2025-05-07T20:33:11.8934025Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8934124Z         contiguous: bool,
2025-05-07T20:33:11.8934225Z         compiled: bool,
2025-05-07T20:33:11.8934314Z     ) -> None:
2025-05-07T20:33:11.8934491Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8934583Z     
2025-05-07T20:33:11.8934759Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8934851Z     
2025-05-07T20:33:11.8934948Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8935080Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8935183Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8935270Z         x0 = x[:, :D]
2025-05-07T20:33:11.8935357Z         x1 = x[:, D:]
2025-05-07T20:33:11.8935444Z     
2025-05-07T20:33:11.8935538Z         if contiguous:
2025-05-07T20:33:11.8935638Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8935748Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8935826Z     
2025-05-07T20:33:11.8935926Z         if scale_ub is not None:
2025-05-07T20:33:11.8936051Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8936197Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8936277Z             )
2025-05-07T20:33:11.8936368Z         else:
2025-05-07T20:33:11.8936472Z             scale_ub_tensor = None
2025-05-07T20:33:11.8936559Z     
2025-05-07T20:33:11.8936694Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8936790Z             op = silu_mul_quant
2025-05-07T20:33:11.8936887Z             if compiled:
2025-05-07T20:33:11.8936995Z                 op = torch.compile(op)
2025-05-07T20:33:11.8937109Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8937197Z     
2025-05-07T20:33:11.8937346Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8937351Z 
2025-05-07T20:33:11.8937453Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8937602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8937709Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8937865Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8938265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8938369Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8938906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8939050Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8939432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8939676Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8940044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8940154Z     kernel = self.compile(
2025-05-07T20:33:11.8940566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8940750Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8940941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8940945Z 
2025-05-07T20:33:11.8941163Z self = <triton.compiler.compiler.ASTSource object at 0x7f99427730e0>
2025-05-07T20:33:11.8941984Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8942501Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9856c31940>}
2025-05-07T20:33:11.8943303Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8943512Z context = <triton._C.libtriton.ir.context object at 0x7f9856d7dc70>
2025-05-07T20:33:11.8943517Z 
2025-05-07T20:33:11.8943695Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8943986Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8944098Z                            module_map=module_map)
2025-05-07T20:33:11.8944265Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8944381Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8944464Z E       ^
2025-05-07T20:33:11.8944850Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8944855Z 
2025-05-07T20:33:11.8945296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8945303Z 
2025-05-07T20:33:11.8945413Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8945658Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8945743Z     T=128,
2025-05-07T20:33:11.8945830Z     D=7168,
2025-05-07T20:33:11.8945930Z     scale_ub=1200.0,
2025-05-07T20:33:11.8946022Z     contiguous=False,
2025-05-07T20:33:11.8946119Z     compiled=True,
2025-05-07T20:33:11.8946197Z )
2025-05-07T20:33:11.8946428Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8946618Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:11.8946671Z 
2025-05-07T20:33:11.8946758Z     @given(
2025-05-07T20:33:11.8946884Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8947002Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8947123Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8947286Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8947416Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8947495Z     )
2025-05-07T20:33:11.8947766Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8947905Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8947990Z         self,
2025-05-07T20:33:11.8948082Z         T: int,
2025-05-07T20:33:11.8948164Z         D: int,
2025-05-07T20:33:11.8948268Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8948371Z         contiguous: bool,
2025-05-07T20:33:11.8948464Z         compiled: bool,
2025-05-07T20:33:11.8948548Z     ) -> None:
2025-05-07T20:33:11.8948663Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8948745Z     
2025-05-07T20:33:11.8948925Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8949014Z     
2025-05-07T20:33:11.8949114Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8949254Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8949350Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8949434Z         x0 = x[:, :D]
2025-05-07T20:33:11.8949526Z         x1 = x[:, D:]
2025-05-07T20:33:11.8949653Z     
2025-05-07T20:33:11.8949748Z         if contiguous:
2025-05-07T20:33:11.8949858Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8949954Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8950033Z     
2025-05-07T20:33:11.8950140Z         if scale_ub is not None:
2025-05-07T20:33:11.8950254Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8950395Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8950488Z             )
2025-05-07T20:33:11.8950574Z         else:
2025-05-07T20:33:11.8950677Z             scale_ub_tensor = None
2025-05-07T20:33:11.8950767Z     
2025-05-07T20:33:11.8950903Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8951008Z             op = silu_mul_quant
2025-05-07T20:33:11.8951098Z             if compiled:
2025-05-07T20:33:11.8951208Z                 op = torch.compile(op)
2025-05-07T20:33:11.8951326Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8951405Z     
2025-05-07T20:33:11.8951504Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8951511Z 
2025-05-07T20:33:11.8951621Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8951761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8951866Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8951981Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8952373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8952484Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8953010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8953114Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8953513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8953750Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8954126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8954227Z     kernel = self.compile(
2025-05-07T20:33:11.8954635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8954827Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8955013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8955018Z 
2025-05-07T20:33:11.8955234Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942771b80>
2025-05-07T20:33:11.8956095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8956618Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9856c32700>}
2025-05-07T20:33:11.8957468Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8957671Z context = <triton._C.libtriton.ir.context object at 0x7f9856d87e30>
2025-05-07T20:33:11.8957679Z 
2025-05-07T20:33:11.8957862Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8958141Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8958255Z                            module_map=module_map)
2025-05-07T20:33:11.8958439Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8958543Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8958672Z E       ^
2025-05-07T20:33:11.8959056Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8959064Z 
2025-05-07T20:33:11.8959505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8959510Z 
2025-05-07T20:33:11.8959630Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8959865Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8959954Z     T=2048,
2025-05-07T20:33:11.8960048Z     D=7168,
2025-05-07T20:33:11.8960140Z     scale_ub=None,
2025-05-07T20:33:11.8960231Z     contiguous=True,
2025-05-07T20:33:11.8960330Z     compiled=True,
2025-05-07T20:33:11.8960411Z )
2025-05-07T20:33:11.8960644Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8960833Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:11.8960840Z 
2025-05-07T20:33:11.8960927Z     @given(
2025-05-07T20:33:11.8961069Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8961177Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8961300Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8961433Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8961551Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8961634Z     )
2025-05-07T20:33:11.8961901Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8962007Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8962101Z         self,
2025-05-07T20:33:11.8962185Z         T: int,
2025-05-07T20:33:11.8962270Z         D: int,
2025-05-07T20:33:11.8962388Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8962487Z         contiguous: bool,
2025-05-07T20:33:11.8962581Z         compiled: bool,
2025-05-07T20:33:11.8962674Z     ) -> None:
2025-05-07T20:33:11.8962782Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8962861Z     
2025-05-07T20:33:11.8963043Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8963121Z     
2025-05-07T20:33:11.8963223Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8963357Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8963448Z         x = x_sign * x_clamp
2025-05-07T20:33:11.8963540Z         x0 = x[:, :D]
2025-05-07T20:33:11.8963679Z         x1 = x[:, D:]
2025-05-07T20:33:11.8963755Z     
2025-05-07T20:33:11.8963845Z         if contiguous:
2025-05-07T20:33:11.8963949Z             x0 = x0.contiguous()
2025-05-07T20:33:11.8964041Z             x1 = x1.contiguous()
2025-05-07T20:33:11.8964116Z     
2025-05-07T20:33:11.8964213Z         if scale_ub is not None:
2025-05-07T20:33:11.8964356Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.8964505Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.8964587Z             )
2025-05-07T20:33:11.8964674Z         else:
2025-05-07T20:33:11.8964834Z             scale_ub_tensor = None
2025-05-07T20:33:11.8964911Z     
2025-05-07T20:33:11.8965049Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.8965146Z             op = silu_mul_quant
2025-05-07T20:33:11.8965234Z             if compiled:
2025-05-07T20:33:11.8965344Z                 op = torch.compile(op)
2025-05-07T20:33:11.8965453Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8965533Z     
2025-05-07T20:33:11.8965632Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.8965636Z 
2025-05-07T20:33:11.8965735Z moe/activation_test.py:117: 
2025-05-07T20:33:11.8965880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8965987Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.8966092Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.8966532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.8966630Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.8967158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.8967264Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.8967643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.8967884Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.8968244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.8968342Z     kernel = self.compile(
2025-05-07T20:33:11.8968758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.8968939Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.8969074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.8969086Z 
2025-05-07T20:33:11.8969296Z self = <triton.compiler.compiler.ASTSource object at 0x7f9942ff8050>
2025-05-07T20:33:11.8970155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.8970680Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9856c337e0>}
2025-05-07T20:33:11.8971476Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.8971682Z context = <triton._C.libtriton.ir.context object at 0x7f9856b17b30>
2025-05-07T20:33:11.8971686Z 
2025-05-07T20:33:11.8971860Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.8972132Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.8972250Z                            module_map=module_map)
2025-05-07T20:33:11.8972419Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.8972527Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.8972655Z E       ^
2025-05-07T20:33:11.8973027Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.8973032Z 
2025-05-07T20:33:11.8973516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.8973521Z 
2025-05-07T20:33:11.8973628Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8973863Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8973989Z     T=16384,
2025-05-07T20:33:11.8974072Z     D=5120,
2025-05-07T20:33:11.8974165Z     scale_ub=None,
2025-05-07T20:33:11.8974256Z     contiguous=False,
2025-05-07T20:33:11.8974391Z     compiled=False,
2025-05-07T20:33:11.8974477Z )
2025-05-07T20:33:11.8974709Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8974893Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:11.8974900Z 
2025-05-07T20:33:11.8974990Z     @given(
2025-05-07T20:33:11.8975108Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8975207Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8975329Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8975445Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8975563Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8975685Z     )
2025-05-07T20:33:11.8975939Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8976041Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8976119Z         self,
2025-05-07T20:33:11.8976198Z         T: int,
2025-05-07T20:33:11.8976279Z         D: int,
2025-05-07T20:33:11.8976377Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8976468Z         contiguous: bool,
2025-05-07T20:33:11.8976557Z         compiled: bool,
2025-05-07T20:33:11.8976639Z     ) -> None:
2025-05-07T20:33:11.8976733Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8976811Z     
2025-05-07T20:33:11.8976983Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8977064Z     
2025-05-07T20:33:11.8977155Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8977282Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8979235Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.8979245Z 
2025-05-07T20:33:11.8979364Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:11.8979369Z 
2025-05-07T20:33:11.8979476Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8979726Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8979810Z     T=4096,
2025-05-07T20:33:11.8979910Z     D=7168,
2025-05-07T20:33:11.8980001Z     scale_ub=1200.0,
2025-05-07T20:33:11.8980086Z     contiguous=True,
2025-05-07T20:33:11.8980177Z     compiled=True,
2025-05-07T20:33:11.8980251Z )
2025-05-07T20:33:11.8980483Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8980661Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:11.8980666Z 
2025-05-07T20:33:11.8980740Z     @given(
2025-05-07T20:33:11.8980862Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8980958Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8981119Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8981253Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8981409Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8981518Z     )
2025-05-07T20:33:11.8981900Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8982001Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8982091Z         self,
2025-05-07T20:33:11.8982172Z         T: int,
2025-05-07T20:33:11.8982256Z         D: int,
2025-05-07T20:33:11.8982363Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8982501Z         contiguous: bool,
2025-05-07T20:33:11.8982592Z         compiled: bool,
2025-05-07T20:33:11.8982675Z     ) -> None:
2025-05-07T20:33:11.8982768Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8982841Z     
2025-05-07T20:33:11.8983018Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8983095Z     
2025-05-07T20:33:11.8983192Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8983323Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8985303Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.8985313Z 
2025-05-07T20:33:11.8985439Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:11.8985443Z 
2025-05-07T20:33:11.8985548Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8985785Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8985869Z     T=16384,
2025-05-07T20:33:11.8985951Z     D=7168,
2025-05-07T20:33:11.8986038Z     scale_ub=None,
2025-05-07T20:33:11.8986122Z     contiguous=False,
2025-05-07T20:33:11.8986207Z     compiled=False,
2025-05-07T20:33:11.8986285Z )
2025-05-07T20:33:11.8986509Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8986690Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:11.8986701Z 
2025-05-07T20:33:11.8986781Z     @given(
2025-05-07T20:33:11.8986899Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8987008Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8987121Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8987239Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8987359Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8987432Z     )
2025-05-07T20:33:11.8987683Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8987784Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8987861Z         self,
2025-05-07T20:33:11.8987939Z         T: int,
2025-05-07T20:33:11.8988025Z         D: int,
2025-05-07T20:33:11.8988127Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8988226Z         contiguous: bool,
2025-05-07T20:33:11.8988313Z         compiled: bool,
2025-05-07T20:33:11.8988389Z     ) -> None:
2025-05-07T20:33:11.8988493Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8988566Z     
2025-05-07T20:33:11.8988740Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8990675Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.8990728Z 
2025-05-07T20:33:11.8990884Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.8990888Z 
2025-05-07T20:33:11.8991001Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8991234Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8991360Z     T=2048,
2025-05-07T20:33:11.8991443Z     D=7168,
2025-05-07T20:33:11.8991529Z     scale_ub=1200.0,
2025-05-07T20:33:11.8991623Z     contiguous=True,
2025-05-07T20:33:11.8991709Z     compiled=True,
2025-05-07T20:33:11.8991784Z )
2025-05-07T20:33:11.8992012Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8992189Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:11.8992198Z 
2025-05-07T20:33:11.8992273Z     @given(
2025-05-07T20:33:11.8992398Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8992499Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8992611Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8992737Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8992848Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8992927Z     )
2025-05-07T20:33:11.8993230Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8993330Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8993418Z         self,
2025-05-07T20:33:11.8993493Z         T: int,
2025-05-07T20:33:11.8993571Z         D: int,
2025-05-07T20:33:11.8993675Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8993765Z         contiguous: bool,
2025-05-07T20:33:11.8993850Z         compiled: bool,
2025-05-07T20:33:11.8993936Z     ) -> None:
2025-05-07T20:33:11.8994032Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8994104Z     
2025-05-07T20:33:11.8994284Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.8994358Z     
2025-05-07T20:33:11.8994458Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.8994584Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.8996496Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.8996513Z 
2025-05-07T20:33:11.8996629Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:11.8996634Z 
2025-05-07T20:33:11.8996733Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.8996967Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.8997047Z     T=2048,
2025-05-07T20:33:11.8997128Z     D=7168,
2025-05-07T20:33:11.8997217Z     scale_ub=None,
2025-05-07T20:33:11.8997303Z     contiguous=True,
2025-05-07T20:33:11.8997390Z     compiled=False,
2025-05-07T20:33:11.8997473Z )
2025-05-07T20:33:11.8997698Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.8997882Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:11.8997887Z 
2025-05-07T20:33:11.8997961Z     @given(
2025-05-07T20:33:11.8998080Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.8998185Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.8998299Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.8998464Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.8998582Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.8998659Z     )
2025-05-07T20:33:11.8998915Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.8999077Z     def test_silu_mul_quant(
2025-05-07T20:33:11.8999175Z         self,
2025-05-07T20:33:11.8999268Z         T: int,
2025-05-07T20:33:11.8999361Z         D: int,
2025-05-07T20:33:11.8999471Z         scale_ub: Optional[float],
2025-05-07T20:33:11.8999610Z         contiguous: bool,
2025-05-07T20:33:11.8999694Z         compiled: bool,
2025-05-07T20:33:11.8999771Z     ) -> None:
2025-05-07T20:33:11.8999872Z         torch.manual_seed(2025)
2025-05-07T20:33:11.8999942Z     
2025-05-07T20:33:11.9000113Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9000197Z     
2025-05-07T20:33:11.9000289Z >       x_sign = torch.sign(x)
2025-05-07T20:33:11.9002255Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.9002265Z 
2025-05-07T20:33:11.9002386Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:11.9002391Z 
2025-05-07T20:33:11.9002498Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9002727Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9002804Z     T=1,
2025-05-07T20:33:11.9002887Z     D=7168,
2025-05-07T20:33:11.9002971Z     scale_ub=1200.0,
2025-05-07T20:33:11.9003061Z     contiguous=True,
2025-05-07T20:33:11.9003150Z     compiled=False,
2025-05-07T20:33:11.9003224Z )
2025-05-07T20:33:11.9003446Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9003625Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:11.9003629Z 
2025-05-07T20:33:11.9003709Z     @given(
2025-05-07T20:33:11.9003824Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9003933Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9004047Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9004169Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9004286Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9004359Z     )
2025-05-07T20:33:11.9004615Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9004709Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9004792Z         self,
2025-05-07T20:33:11.9004874Z         T: int,
2025-05-07T20:33:11.9004949Z         D: int,
2025-05-07T20:33:11.9005047Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9005142Z         contiguous: bool,
2025-05-07T20:33:11.9005226Z         compiled: bool,
2025-05-07T20:33:11.9005309Z     ) -> None:
2025-05-07T20:33:11.9005404Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9005477Z     
2025-05-07T20:33:11.9005656Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9005727Z     
2025-05-07T20:33:11.9005820Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.9005950Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.9006040Z         x = x_sign * x_clamp
2025-05-07T20:33:11.9006123Z         x0 = x[:, :D]
2025-05-07T20:33:11.9006207Z         x1 = x[:, D:]
2025-05-07T20:33:11.9006280Z     
2025-05-07T20:33:11.9006365Z         if contiguous:
2025-05-07T20:33:11.9006465Z             x0 = x0.contiguous()
2025-05-07T20:33:11.9006607Z             x1 = x1.contiguous()
2025-05-07T20:33:11.9006678Z     
2025-05-07T20:33:11.9006773Z         if scale_ub is not None:
2025-05-07T20:33:11.9006878Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.9007020Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.9007138Z             )
2025-05-07T20:33:11.9007217Z         else:
2025-05-07T20:33:11.9007318Z             scale_ub_tensor = None
2025-05-07T20:33:11.9007389Z     
2025-05-07T20:33:11.9007526Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.9007662Z             op = silu_mul_quant
2025-05-07T20:33:11.9007749Z             if compiled:
2025-05-07T20:33:11.9007847Z                 op = torch.compile(op)
2025-05-07T20:33:11.9007961Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.9008034Z     
2025-05-07T20:33:11.9008122Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.9008132Z 
2025-05-07T20:33:11.9008229Z moe/activation_test.py:117: 
2025-05-07T20:33:11.9008369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.9008475Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.9008573Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.9009110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.9009213Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.9009667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.9009928Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.9010294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.9010388Z     kernel = self.compile(
2025-05-07T20:33:11.9010799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.9010980Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.9011109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.9011114Z 
2025-05-07T20:33:11.9011331Z self = <triton.compiler.compiler.ASTSource object at 0x7f98574379b0>
2025-05-07T20:33:11.9012151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.9012677Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9856b22b60>}
2025-05-07T20:33:11.9013471Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.9013673Z context = <triton._C.libtriton.ir.context object at 0x7f9856b8fa30>
2025-05-07T20:33:11.9013678Z 
2025-05-07T20:33:11.9013848Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.9014127Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.9014242Z                            module_map=module_map)
2025-05-07T20:33:11.9014488Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.9014594Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.9014679Z E       ^
2025-05-07T20:33:11.9015048Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.9015053Z 
2025-05-07T20:33:11.9015495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.9015548Z 
2025-05-07T20:33:11.9015652Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9015879Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9015969Z     T=128,
2025-05-07T20:33:11.9016048Z     D=5120,
2025-05-07T20:33:11.9016168Z     scale_ub=None,
2025-05-07T20:33:11.9016258Z     contiguous=True,
2025-05-07T20:33:11.9016342Z     compiled=False,
2025-05-07T20:33:11.9016422Z )
2025-05-07T20:33:11.9016651Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9016864Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:11.9016868Z 
2025-05-07T20:33:11.9016947Z     @given(
2025-05-07T20:33:11.9017064Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9017162Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9017283Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9017401Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9017513Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9017592Z     )
2025-05-07T20:33:11.9017843Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9017946Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9018025Z         self,
2025-05-07T20:33:11.9018103Z         T: int,
2025-05-07T20:33:11.9018186Z         D: int,
2025-05-07T20:33:11.9018331Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9018422Z         contiguous: bool,
2025-05-07T20:33:11.9018513Z         compiled: bool,
2025-05-07T20:33:11.9018590Z     ) -> None:
2025-05-07T20:33:11.9018687Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9018767Z     
2025-05-07T20:33:11.9018936Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9019013Z     
2025-05-07T20:33:11.9019109Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.9019234Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.9019328Z         x = x_sign * x_clamp
2025-05-07T20:33:11.9019405Z         x0 = x[:, :D]
2025-05-07T20:33:11.9019484Z         x1 = x[:, D:]
2025-05-07T20:33:11.9019561Z     
2025-05-07T20:33:11.9019642Z         if contiguous:
2025-05-07T20:33:11.9019733Z             x0 = x0.contiguous()
2025-05-07T20:33:11.9019826Z             x1 = x1.contiguous()
2025-05-07T20:33:11.9019897Z     
2025-05-07T20:33:11.9019987Z         if scale_ub is not None:
2025-05-07T20:33:11.9020103Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.9020241Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.9020314Z             )
2025-05-07T20:33:11.9020394Z         else:
2025-05-07T20:33:11.9020487Z             scale_ub_tensor = None
2025-05-07T20:33:11.9020560Z     
2025-05-07T20:33:11.9020691Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.9020784Z             op = silu_mul_quant
2025-05-07T20:33:11.9020873Z             if compiled:
2025-05-07T20:33:11.9020970Z                 op = torch.compile(op)
2025-05-07T20:33:11.9021075Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.9021154Z     
2025-05-07T20:33:11.9021244Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.9021248Z 
2025-05-07T20:33:11.9021349Z moe/activation_test.py:117: 
2025-05-07T20:33:11.9021487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.9021593Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.9021693Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.9022230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.9022327Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.9022710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.9022936Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.9023346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.9023447Z     kernel = self.compile(
2025-05-07T20:33:11.9023889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.9024074Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.9024210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.9024256Z 
2025-05-07T20:33:11.9024466Z self = <triton.compiler.compiler.ASTSource object at 0x7f9857437890>
2025-05-07T20:33:11.9025283Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.9026146Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9856b23c40>}
2025-05-07T20:33:11.9026960Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.9027249Z context = <triton._C.libtriton.ir.context object at 0x7f9856787db0>
2025-05-07T20:33:11.9027254Z 
2025-05-07T20:33:11.9027435Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.9027718Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.9027828Z                            module_map=module_map)
2025-05-07T20:33:11.9027999Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.9028099Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.9028180Z E       ^
2025-05-07T20:33:11.9028559Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.9028564Z 
2025-05-07T20:33:11.9029009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.9029014Z 
2025-05-07T20:33:11.9029125Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9029362Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9029447Z     T=128,
2025-05-07T20:33:11.9029529Z     D=7168,
2025-05-07T20:33:11.9029616Z     scale_ub=None,
2025-05-07T20:33:11.9029706Z     contiguous=True,
2025-05-07T20:33:11.9029805Z     compiled=False,
2025-05-07T20:33:11.9029883Z )
2025-05-07T20:33:11.9030113Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9034999Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:11.9035012Z 
2025-05-07T20:33:11.9035101Z     @given(
2025-05-07T20:33:11.9035225Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9035328Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9035443Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9035564Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9035686Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9035760Z     )
2025-05-07T20:33:11.9036024Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9036121Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9036202Z         self,
2025-05-07T20:33:11.9036281Z         T: int,
2025-05-07T20:33:11.9036358Z         D: int,
2025-05-07T20:33:11.9036457Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9036550Z         contiguous: bool,
2025-05-07T20:33:11.9036636Z         compiled: bool,
2025-05-07T20:33:11.9036838Z     ) -> None:
2025-05-07T20:33:11.9036941Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9037021Z     
2025-05-07T20:33:11.9037196Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9037275Z     
2025-05-07T20:33:11.9037369Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.9037575Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.9037666Z         x = x_sign * x_clamp
2025-05-07T20:33:11.9037746Z         x0 = x[:, :D]
2025-05-07T20:33:11.9037836Z         x1 = x[:, D:]
2025-05-07T20:33:11.9037910Z     
2025-05-07T20:33:11.9038060Z         if contiguous:
2025-05-07T20:33:11.9038157Z             x0 = x0.contiguous()
2025-05-07T20:33:11.9038251Z             x1 = x1.contiguous()
2025-05-07T20:33:11.9038324Z     
2025-05-07T20:33:11.9038427Z         if scale_ub is not None:
2025-05-07T20:33:11.9038535Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.9038673Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.9038755Z             )
2025-05-07T20:33:11.9038834Z         else:
2025-05-07T20:33:11.9038935Z             scale_ub_tensor = None
2025-05-07T20:33:11.9039011Z     
2025-05-07T20:33:11.9039144Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.9039242Z             op = silu_mul_quant
2025-05-07T20:33:11.9039332Z             if compiled:
2025-05-07T20:33:11.9039436Z                 op = torch.compile(op)
2025-05-07T20:33:11.9039544Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.9039672Z     
2025-05-07T20:33:11.9039765Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.9039772Z 
2025-05-07T20:33:11.9039878Z moe/activation_test.py:117: 
2025-05-07T20:33:11.9040013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.9040118Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.9040226Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.9040761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.9040871Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.9041250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.9041482Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.9041852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.9041949Z     kernel = self.compile(
2025-05-07T20:33:11.9042370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.9042548Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.9042681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.9042686Z 
2025-05-07T20:33:11.9042901Z self = <triton.compiler.compiler.ASTSource object at 0x7f9856a046b0>
2025-05-07T20:33:11.9043715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.9044239Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9856a00ae0>}
2025-05-07T20:33:11.9045037Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.9045236Z context = <triton._C.libtriton.ir.context object at 0x7f9856a21db0>
2025-05-07T20:33:11.9045241Z 
2025-05-07T20:33:11.9045416Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.9045736Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.9045854Z                            module_map=module_map)
2025-05-07T20:33:11.9046021Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.9046163Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.9046249Z E       ^
2025-05-07T20:33:11.9046622Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.9046626Z 
2025-05-07T20:33:11.9047068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.9047112Z 
2025-05-07T20:33:11.9047217Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9047444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9047531Z     T=2048,
2025-05-07T20:33:11.9047611Z     D=7168,
2025-05-07T20:33:11.9047703Z     scale_ub=1200.0,
2025-05-07T20:33:11.9047795Z     contiguous=True,
2025-05-07T20:33:11.9047882Z     compiled=False,
2025-05-07T20:33:11.9047958Z )
2025-05-07T20:33:11.9048182Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9048363Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:11.9048367Z 
2025-05-07T20:33:11.9048455Z     @given(
2025-05-07T20:33:11.9048572Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9048719Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9048844Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9048959Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9049076Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9049158Z     )
2025-05-07T20:33:11.9049413Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9049514Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9049598Z         self,
2025-05-07T20:33:11.9049675Z         T: int,
2025-05-07T20:33:11.9049753Z         D: int,
2025-05-07T20:33:11.9049855Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9049947Z         contiguous: bool,
2025-05-07T20:33:11.9050039Z         compiled: bool,
2025-05-07T20:33:11.9050120Z     ) -> None:
2025-05-07T20:33:11.9050213Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9050284Z     
2025-05-07T20:33:11.9050456Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9052380Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.9052401Z 
2025-05-07T20:33:11.9052515Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.9052519Z 
2025-05-07T20:33:11.9052618Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9052862Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9052941Z     T=1,
2025-05-07T20:33:11.9053024Z     D=5120,
2025-05-07T20:33:11.9053114Z     scale_ub=1200.0,
2025-05-07T20:33:11.9053206Z     contiguous=True,
2025-05-07T20:33:11.9053301Z     compiled=False,
2025-05-07T20:33:11.9053377Z )
2025-05-07T20:33:11.9053598Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9053775Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:11.9053779Z 
2025-05-07T20:33:11.9053857Z     @given(
2025-05-07T20:33:11.9053973Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9054119Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9054230Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9054438Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9054591Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9054667Z     )
2025-05-07T20:33:11.9054919Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9055014Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9055092Z         self,
2025-05-07T20:33:11.9055213Z         T: int,
2025-05-07T20:33:11.9055288Z         D: int,
2025-05-07T20:33:11.9055383Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9055478Z         contiguous: bool,
2025-05-07T20:33:11.9055560Z         compiled: bool,
2025-05-07T20:33:11.9055636Z     ) -> None:
2025-05-07T20:33:11.9055736Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9055813Z     
2025-05-07T20:33:11.9055991Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9056080Z     
2025-05-07T20:33:11.9056177Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.9056306Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.9056402Z         x = x_sign * x_clamp
2025-05-07T20:33:11.9056486Z         x0 = x[:, :D]
2025-05-07T20:33:11.9056576Z         x1 = x[:, D:]
2025-05-07T20:33:11.9056653Z     
2025-05-07T20:33:11.9056742Z         if contiguous:
2025-05-07T20:33:11.9056889Z             x0 = x0.contiguous()
2025-05-07T20:33:11.9056978Z             x1 = x1.contiguous()
2025-05-07T20:33:11.9057056Z     
2025-05-07T20:33:11.9057155Z         if scale_ub is not None:
2025-05-07T20:33:11.9057262Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.9057400Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.9057482Z             )
2025-05-07T20:33:11.9057563Z         else:
2025-05-07T20:33:11.9057664Z             scale_ub_tensor = None
2025-05-07T20:33:11.9057740Z     
2025-05-07T20:33:11.9057871Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.9057966Z             op = silu_mul_quant
2025-05-07T20:33:11.9058053Z             if compiled:
2025-05-07T20:33:11.9058150Z                 op = torch.compile(op)
2025-05-07T20:33:11.9058257Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.9058333Z     
2025-05-07T20:33:11.9058424Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.9058429Z 
2025-05-07T20:33:11.9058528Z moe/activation_test.py:117: 
2025-05-07T20:33:11.9058664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.9058770Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.9058868Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.9059395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.9059500Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.9059881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.9060108Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.9060474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.9060568Z     kernel = self.compile(
2025-05-07T20:33:11.9060978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.9061161Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.9061291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.9061295Z 
2025-05-07T20:33:11.9061506Z self = <triton.compiler.compiler.ASTSource object at 0x7f9856a057c0>
2025-05-07T20:33:11.9062321Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.9062920Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f9856a020c0>}
2025-05-07T20:33:11.9063721Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.9063957Z context = <triton._C.libtriton.ir.context object at 0x7f985674da30>
2025-05-07T20:33:11.9063966Z 
2025-05-07T20:33:11.9064137Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.9064408Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.9064521Z                            module_map=module_map)
2025-05-07T20:33:11.9064681Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.9064776Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.9064851Z E       ^
2025-05-07T20:33:11.9065224Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.9065229Z 
2025-05-07T20:33:11.9065750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.9065760Z 
2025-05-07T20:33:11.9065860Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9066089Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9066167Z     T=2048,
2025-05-07T20:33:11.9066243Z     D=5120,
2025-05-07T20:33:11.9066326Z     scale_ub=None,
2025-05-07T20:33:11.9066412Z     contiguous=True,
2025-05-07T20:33:11.9066493Z     compiled=False,
2025-05-07T20:33:11.9066570Z )
2025-05-07T20:33:11.9066802Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9066981Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:11.9066985Z 
2025-05-07T20:33:11.9067068Z     @given(
2025-05-07T20:33:11.9067185Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9067283Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9067399Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9067513Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9067627Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9067706Z     )
2025-05-07T20:33:11.9067955Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9068045Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9068123Z         self,
2025-05-07T20:33:11.9068198Z         T: int,
2025-05-07T20:33:11.9068282Z         D: int,
2025-05-07T20:33:11.9068376Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9068465Z         contiguous: bool,
2025-05-07T20:33:11.9068551Z         compiled: bool,
2025-05-07T20:33:11.9068630Z     ) -> None:
2025-05-07T20:33:11.9068727Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9068803Z     
2025-05-07T20:33:11.9068974Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9069047Z     
2025-05-07T20:33:11.9069140Z >       x_sign = torch.sign(x)
2025-05-07T20:33:11.9071063Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.9071119Z 
2025-05-07T20:33:11.9071241Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:11.9071245Z 
2025-05-07T20:33:11.9071350Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9071624Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9071706Z     T=16384,
2025-05-07T20:33:11.9071778Z     D=5120,
2025-05-07T20:33:11.9071871Z     scale_ub=None,
2025-05-07T20:33:11.9071954Z     contiguous=True,
2025-05-07T20:33:11.9072075Z     compiled=False,
2025-05-07T20:33:11.9072146Z )
2025-05-07T20:33:11.9072366Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9072545Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:11.9072550Z 
2025-05-07T20:33:11.9072635Z     @given(
2025-05-07T20:33:11.9072754Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9072864Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9072977Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9073092Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9073208Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9073283Z     )
2025-05-07T20:33:11.9073533Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9073626Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9073746Z         self,
2025-05-07T20:33:11.9073820Z         T: int,
2025-05-07T20:33:11.9073899Z         D: int,
2025-05-07T20:33:11.9073994Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9074082Z         contiguous: bool,
2025-05-07T20:33:11.9074168Z         compiled: bool,
2025-05-07T20:33:11.9074244Z     ) -> None:
2025-05-07T20:33:11.9074336Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9074406Z     
2025-05-07T20:33:11.9074576Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9076506Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.9076513Z 
2025-05-07T20:33:11.9076629Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.9076633Z 
2025-05-07T20:33:11.9076733Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9076956Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9077033Z     T=4096,
2025-05-07T20:33:11.9077113Z     D=5120,
2025-05-07T20:33:11.9077194Z     scale_ub=None,
2025-05-07T20:33:11.9077279Z     contiguous=True,
2025-05-07T20:33:11.9077368Z     compiled=False,
2025-05-07T20:33:11.9077441Z )
2025-05-07T20:33:11.9077663Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9077844Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:11.9077849Z 
2025-05-07T20:33:11.9077923Z     @given(
2025-05-07T20:33:11.9078041Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9078136Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9078249Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9078370Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9078478Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9078552Z     )
2025-05-07T20:33:11.9078806Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9078946Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9079022Z         self,
2025-05-07T20:33:11.9079098Z         T: int,
2025-05-07T20:33:11.9079177Z         D: int,
2025-05-07T20:33:11.9079278Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9079364Z         contiguous: bool,
2025-05-07T20:33:11.9079482Z         compiled: bool,
2025-05-07T20:33:11.9079565Z     ) -> None:
2025-05-07T20:33:11.9079657Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9079729Z     
2025-05-07T20:33:11.9079906Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9081855Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.9081864Z 
2025-05-07T20:33:11.9081983Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.9081988Z 
2025-05-07T20:33:11.9082094Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9082334Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9082414Z     T=2048,
2025-05-07T20:33:11.9082531Z     D=5120,
2025-05-07T20:33:11.9082623Z     scale_ub=None,
2025-05-07T20:33:11.9082709Z     contiguous=False,
2025-05-07T20:33:11.9082791Z     compiled=False,
2025-05-07T20:33:11.9082871Z )
2025-05-07T20:33:11.9083091Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9083267Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:11.9083272Z 
2025-05-07T20:33:11.9083351Z     @given(
2025-05-07T20:33:11.9083467Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9083569Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9083681Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9083797Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9083917Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9083984Z     )
2025-05-07T20:33:11.9084237Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9084333Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9084410Z         self,
2025-05-07T20:33:11.9084483Z         T: int,
2025-05-07T20:33:11.9084565Z         D: int,
2025-05-07T20:33:11.9084661Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9084747Z         contiguous: bool,
2025-05-07T20:33:11.9084835Z         compiled: bool,
2025-05-07T20:33:11.9084912Z     ) -> None:
2025-05-07T20:33:11.9085006Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9085075Z     
2025-05-07T20:33:11.9085246Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9087158Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.9087166Z 
2025-05-07T20:33:11.9087281Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.9087286Z 
2025-05-07T20:33:11.9087389Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9087615Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9087735Z     T=4096,
2025-05-07T20:33:11.9087818Z     D=7168,
2025-05-07T20:33:11.9087907Z     scale_ub=None,
2025-05-07T20:33:11.9087993Z     contiguous=True,
2025-05-07T20:33:11.9088080Z     compiled=True,
2025-05-07T20:33:11.9088154Z )
2025-05-07T20:33:11.9088411Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9088593Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:11.9088597Z 
2025-05-07T20:33:11.9088678Z     @given(
2025-05-07T20:33:11.9088797Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9088943Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9089059Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9089185Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9089301Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9089373Z     )
2025-05-07T20:33:11.9089627Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9089728Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9089808Z         self,
2025-05-07T20:33:11.9089887Z         T: int,
2025-05-07T20:33:11.9089963Z         D: int,
2025-05-07T20:33:11.9090063Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9090155Z         contiguous: bool,
2025-05-07T20:33:11.9090243Z         compiled: bool,
2025-05-07T20:33:11.9090323Z     ) -> None:
2025-05-07T20:33:11.9090460Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9090539Z     
2025-05-07T20:33:11.9090726Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9092636Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.9092645Z 
2025-05-07T20:33:11.9092768Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.9092772Z 
2025-05-07T20:33:11.9092872Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9093103Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9093179Z     T=2048,
2025-05-07T20:33:11.9093255Z     D=5120,
2025-05-07T20:33:11.9093338Z     scale_ub=1200.0,
2025-05-07T20:33:11.9093425Z     contiguous=False,
2025-05-07T20:33:11.9093510Z     compiled=False,
2025-05-07T20:33:11.9093589Z )
2025-05-07T20:33:11.9093808Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9093986Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:11.9093993Z 
2025-05-07T20:33:11.9094076Z     @given(
2025-05-07T20:33:11.9094194Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9094296Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9094530Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9094650Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9094764Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9094835Z     )
2025-05-07T20:33:11.9095087Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9095185Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9095260Z         self,
2025-05-07T20:33:11.9095337Z         T: int,
2025-05-07T20:33:11.9095419Z         D: int,
2025-05-07T20:33:11.9095520Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9095609Z         contiguous: bool,
2025-05-07T20:33:11.9095700Z         compiled: bool,
2025-05-07T20:33:11.9095777Z     ) -> None:
2025-05-07T20:33:11.9095932Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9096004Z     
2025-05-07T20:33:11.9096172Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9098127Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.9098192Z 
2025-05-07T20:33:11.9098314Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.9098318Z 
2025-05-07T20:33:11.9098418Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9098650Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9098730Z     T=4096,
2025-05-07T20:33:11.9098802Z     D=7168,
2025-05-07T20:33:11.9098887Z     scale_ub=1200.0,
2025-05-07T20:33:11.9098968Z     contiguous=True,
2025-05-07T20:33:11.9099050Z     compiled=False,
2025-05-07T20:33:11.9099128Z )
2025-05-07T20:33:11.9099349Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9099563Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:11.9099567Z 
2025-05-07T20:33:11.9099646Z     @given(
2025-05-07T20:33:11.9099760Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9099856Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9099970Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9100081Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9100192Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9100269Z     )
2025-05-07T20:33:11.9100517Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9100611Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9100686Z         self,
2025-05-07T20:33:11.9100761Z         T: int,
2025-05-07T20:33:11.9100843Z         D: int,
2025-05-07T20:33:11.9100939Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9101026Z         contiguous: bool,
2025-05-07T20:33:11.9101110Z         compiled: bool,
2025-05-07T20:33:11.9101187Z     ) -> None:
2025-05-07T20:33:11.9101277Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9101351Z     
2025-05-07T20:33:11.9101517Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9103428Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.9103439Z 
2025-05-07T20:33:11.9103557Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.9103562Z 
2025-05-07T20:33:11.9103666Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9103892Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9103972Z     T=16384,
2025-05-07T20:33:11.9104055Z     D=7168,
2025-05-07T20:33:11.9104138Z     scale_ub=None,
2025-05-07T20:33:11.9104226Z     contiguous=False,
2025-05-07T20:33:11.9104313Z     compiled=True,
2025-05-07T20:33:11.9104387Z )
2025-05-07T20:33:11.9104606Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9104831Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:11.9104836Z 
2025-05-07T20:33:11.9104913Z     @given(
2025-05-07T20:33:11.9105034Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9105132Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9105282Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9105401Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9105513Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9105589Z     )
2025-05-07T20:33:11.9105881Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9105971Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9106052Z         self,
2025-05-07T20:33:11.9106125Z         T: int,
2025-05-07T20:33:11.9106205Z         D: int,
2025-05-07T20:33:11.9106307Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9106394Z         contiguous: bool,
2025-05-07T20:33:11.9106478Z         compiled: bool,
2025-05-07T20:33:11.9106556Z     ) -> None:
2025-05-07T20:33:11.9106648Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9106717Z     
2025-05-07T20:33:11.9106889Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9108841Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.9108850Z 
2025-05-07T20:33:11.9108970Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.9108978Z 
2025-05-07T20:33:11.9109088Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9109352Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9109428Z     T=4096,
2025-05-07T20:33:11.9109503Z     D=7168,
2025-05-07T20:33:11.9109591Z     scale_ub=None,
2025-05-07T20:33:11.9109676Z     contiguous=True,
2025-05-07T20:33:11.9109759Z     compiled=False,
2025-05-07T20:33:11.9109833Z )
2025-05-07T20:33:11.9110051Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9110224Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:11.9110230Z 
2025-05-07T20:33:11.9110313Z     @given(
2025-05-07T20:33:11.9110432Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9110529Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9110640Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9110754Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9110870Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9110946Z     )
2025-05-07T20:33:11.9111194Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9111286Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9111362Z         self,
2025-05-07T20:33:11.9111440Z         T: int,
2025-05-07T20:33:11.9111520Z         D: int,
2025-05-07T20:33:11.9111616Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9111706Z         contiguous: bool,
2025-05-07T20:33:11.9111789Z         compiled: bool,
2025-05-07T20:33:11.9111867Z     ) -> None:
2025-05-07T20:33:11.9111961Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9112036Z     
2025-05-07T20:33:11.9112209Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9114165Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.9114206Z 
2025-05-07T20:33:11.9114323Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.9114330Z 
2025-05-07T20:33:11.9114433Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9114697Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9114776Z     T=16384,
2025-05-07T20:33:11.9114852Z     D=7168,
2025-05-07T20:33:11.9114933Z     scale_ub=None,
2025-05-07T20:33:11.9115019Z     contiguous=True,
2025-05-07T20:33:11.9115107Z     compiled=False,
2025-05-07T20:33:11.9115181Z )
2025-05-07T20:33:11.9115401Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9115585Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:11.9115590Z 
2025-05-07T20:33:11.9115662Z     @given(
2025-05-07T20:33:11.9115778Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9115874Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9115986Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9116147Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9116259Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9116335Z     )
2025-05-07T20:33:11.9116590Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9116682Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9116760Z         self,
2025-05-07T20:33:11.9116832Z         T: int,
2025-05-07T20:33:11.9116906Z         D: int,
2025-05-07T20:33:11.9117003Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9117092Z         contiguous: bool,
2025-05-07T20:33:11.9117173Z         compiled: bool,
2025-05-07T20:33:11.9117250Z     ) -> None:
2025-05-07T20:33:11.9117344Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9117416Z     
2025-05-07T20:33:11.9117592Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9119550Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.9119560Z 
2025-05-07T20:33:11.9119679Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.9119683Z 
2025-05-07T20:33:11.9119783Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9120010Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9120086Z     T=16384,
2025-05-07T20:33:11.9120164Z     D=7168,
2025-05-07T20:33:11.9120254Z     scale_ub=1200.0,
2025-05-07T20:33:11.9120337Z     contiguous=True,
2025-05-07T20:33:11.9120418Z     compiled=False,
2025-05-07T20:33:11.9120497Z )
2025-05-07T20:33:11.9120718Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9120898Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:11.9120903Z 
2025-05-07T20:33:11.9120984Z     @given(
2025-05-07T20:33:11.9121103Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9121199Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9121314Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9121474Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9121590Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9121663Z     )
2025-05-07T20:33:11.9121911Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9122040Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9122116Z         self,
2025-05-07T20:33:11.9122190Z         T: int,
2025-05-07T20:33:11.9122267Z         D: int,
2025-05-07T20:33:11.9122365Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9122491Z         contiguous: bool,
2025-05-07T20:33:11.9122579Z         compiled: bool,
2025-05-07T20:33:11.9122654Z     ) -> None:
2025-05-07T20:33:11.9122744Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9122818Z     
2025-05-07T20:33:11.9122990Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9124901Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.9124948Z 
2025-05-07T20:33:11.9125065Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.9125072Z 
2025-05-07T20:33:11.9125178Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9125588Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9125703Z     T=128,
2025-05-07T20:33:11.9125815Z     D=5120,
2025-05-07T20:33:11.9125914Z     scale_ub=1200.0,
2025-05-07T20:33:11.9126004Z     contiguous=False,
2025-05-07T20:33:11.9126090Z     compiled=False,
2025-05-07T20:33:11.9126163Z )
2025-05-07T20:33:11.9126382Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9126556Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:11.9126561Z 
2025-05-07T20:33:11.9126636Z     @given(
2025-05-07T20:33:11.9126756Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9126853Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9126964Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9127083Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9127195Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9127263Z     )
2025-05-07T20:33:11.9127512Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9127600Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9127679Z         self,
2025-05-07T20:33:11.9127753Z         T: int,
2025-05-07T20:33:11.9127828Z         D: int,
2025-05-07T20:33:11.9127926Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9128013Z         contiguous: bool,
2025-05-07T20:33:11.9128096Z         compiled: bool,
2025-05-07T20:33:11.9128174Z     ) -> None:
2025-05-07T20:33:11.9128264Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9128339Z     
2025-05-07T20:33:11.9128510Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9128583Z     
2025-05-07T20:33:11.9128673Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.9128797Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.9128884Z         x = x_sign * x_clamp
2025-05-07T20:33:11.9128963Z         x0 = x[:, :D]
2025-05-07T20:33:11.9129046Z         x1 = x[:, D:]
2025-05-07T20:33:11.9129115Z     
2025-05-07T20:33:11.9129201Z         if contiguous:
2025-05-07T20:33:11.9129290Z             x0 = x0.contiguous()
2025-05-07T20:33:11.9129381Z             x1 = x1.contiguous()
2025-05-07T20:33:11.9129539Z     
2025-05-07T20:33:11.9129629Z         if scale_ub is not None:
2025-05-07T20:33:11.9129733Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.9129870Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.9129946Z             )
2025-05-07T20:33:11.9130019Z         else:
2025-05-07T20:33:11.9130197Z             scale_ub_tensor = None
2025-05-07T20:33:11.9130269Z     
2025-05-07T20:33:11.9130396Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.9130489Z             op = silu_mul_quant
2025-05-07T20:33:11.9130570Z             if compiled:
2025-05-07T20:33:11.9130728Z                 op = torch.compile(op)
2025-05-07T20:33:11.9130830Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.9130899Z     
2025-05-07T20:33:11.9130989Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.9130993Z 
2025-05-07T20:33:11.9131087Z moe/activation_test.py:117: 
2025-05-07T20:33:11.9131215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.9131318Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.9131413Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.9131939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.9132045Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.9132480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.9132709Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.9133067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.9133157Z     kernel = self.compile(
2025-05-07T20:33:11.9133562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.9133739Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.9133875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.9133880Z 
2025-05-07T20:33:11.9134087Z self = <triton.compiler.compiler.ASTSource object at 0x7f9856971b20>
2025-05-07T20:33:11.9134976Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.9135493Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98569cccc0>}
2025-05-07T20:33:11.9136285Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.9136483Z context = <triton._C.libtriton.ir.context object at 0x7f9856816c30>
2025-05-07T20:33:11.9136487Z 
2025-05-07T20:33:11.9136652Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.9136924Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.9137030Z                            module_map=module_map)
2025-05-07T20:33:11.9137189Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.9137291Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.9137366Z E       ^
2025-05-07T20:33:11.9137732Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.9137737Z 
2025-05-07T20:33:11.9138173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.9138178Z 
2025-05-07T20:33:11.9138278Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9138561Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9138637Z     T=2048,
2025-05-07T20:33:11.9138710Z     D=7168,
2025-05-07T20:33:11.9138796Z     scale_ub=None,
2025-05-07T20:33:11.9138880Z     contiguous=False,
2025-05-07T20:33:11.9138998Z     compiled=False,
2025-05-07T20:33:11.9139078Z )
2025-05-07T20:33:11.9139305Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9139485Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:11.9139527Z 
2025-05-07T20:33:11.9139607Z     @given(
2025-05-07T20:33:11.9139720Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9139821Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9139931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9140044Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9140156Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9140231Z     )
2025-05-07T20:33:11.9140480Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9140574Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9140646Z         self,
2025-05-07T20:33:11.9140721Z         T: int,
2025-05-07T20:33:11.9140802Z         D: int,
2025-05-07T20:33:11.9140897Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9140981Z         contiguous: bool,
2025-05-07T20:33:11.9141108Z         compiled: bool,
2025-05-07T20:33:11.9141186Z     ) -> None:
2025-05-07T20:33:11.9141281Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9141351Z     
2025-05-07T20:33:11.9141520Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9144728Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.9144738Z 
2025-05-07T20:33:11.9144854Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.9144858Z 
2025-05-07T20:33:11.9144961Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9145192Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9145267Z     T=128,
2025-05-07T20:33:11.9145347Z     D=7168,
2025-05-07T20:33:11.9145427Z     scale_ub=1200.0,
2025-05-07T20:33:11.9145510Z     contiguous=True,
2025-05-07T20:33:11.9145594Z     compiled=True,
2025-05-07T20:33:11.9145667Z )
2025-05-07T20:33:11.9145889Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9146063Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:11.9146068Z 
2025-05-07T20:33:11.9146145Z     @given(
2025-05-07T20:33:11.9146262Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9146361Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9146473Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9146588Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9146698Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9146770Z     )
2025-05-07T20:33:11.9147022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9147116Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9147188Z         self,
2025-05-07T20:33:11.9147263Z         T: int,
2025-05-07T20:33:11.9147339Z         D: int,
2025-05-07T20:33:11.9147439Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9147569Z         contiguous: bool,
2025-05-07T20:33:11.9147654Z         compiled: bool,
2025-05-07T20:33:11.9147733Z     ) -> None:
2025-05-07T20:33:11.9147828Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9147895Z     
2025-05-07T20:33:11.9148067Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9148178Z     
2025-05-07T20:33:11.9148267Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.9148391Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.9148480Z         x = x_sign * x_clamp
2025-05-07T20:33:11.9148560Z         x0 = x[:, :D]
2025-05-07T20:33:11.9148677Z         x1 = x[:, D:]
2025-05-07T20:33:11.9148747Z     
2025-05-07T20:33:11.9148835Z         if contiguous:
2025-05-07T20:33:11.9148924Z             x0 = x0.contiguous()
2025-05-07T20:33:11.9149010Z             x1 = x1.contiguous()
2025-05-07T20:33:11.9149082Z     
2025-05-07T20:33:11.9149168Z         if scale_ub is not None:
2025-05-07T20:33:11.9149271Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.9149410Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.9149485Z             )
2025-05-07T20:33:11.9149556Z         else:
2025-05-07T20:33:11.9149650Z             scale_ub_tensor = None
2025-05-07T20:33:11.9149723Z     
2025-05-07T20:33:11.9149853Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.9149946Z             op = silu_mul_quant
2025-05-07T20:33:11.9150028Z             if compiled:
2025-05-07T20:33:11.9150168Z                 op = torch.compile(op)
2025-05-07T20:33:11.9150271Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.9150345Z     
2025-05-07T20:33:11.9150434Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.9150438Z 
2025-05-07T20:33:11.9150531Z moe/activation_test.py:117: 
2025-05-07T20:33:11.9150661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.9150763Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.9150860Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.9151252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:11.9151348Z     return fn(*args, **kwargs)
2025-05-07T20:33:11.9151869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.9151974Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.9152350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.9152579Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.9152942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.9153032Z     kernel = self.compile(
2025-05-07T20:33:11.9153435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.9153611Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.9153740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.9153744Z 
2025-05-07T20:33:11.9153957Z self = <triton.compiler.compiler.ASTSource object at 0x7f985685ac90>
2025-05-07T20:33:11.9154767Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.9155281Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f9989e25440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f98569cda80>}
2025-05-07T20:33:11.9156072Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.9156310Z context = <triton._C.libtriton.ir.context object at 0x7f9856889170>
2025-05-07T20:33:11.9156321Z 
2025-05-07T20:33:11.9156489Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.9156802Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.9156914Z                            module_map=module_map)
2025-05-07T20:33:11.9157073Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.9157211Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.9157288Z E       ^
2025-05-07T20:33:11.9157655Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.9157660Z 
2025-05-07T20:33:11.9158099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.9158109Z 
2025-05-07T20:33:11.9158207Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9158431Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9158516Z     T=128,
2025-05-07T20:33:11.9158591Z     D=7168,
2025-05-07T20:33:11.9158670Z     scale_ub=1200.0,
2025-05-07T20:33:11.9158757Z     contiguous=True,
2025-05-07T20:33:11.9158837Z     compiled=False,
2025-05-07T20:33:11.9158905Z )
2025-05-07T20:33:11.9159169Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9159344Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:11.9159348Z 
2025-05-07T20:33:11.9159427Z     @given(
2025-05-07T20:33:11.9159541Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9159637Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9159750Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9159862Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9159974Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9160045Z     )
2025-05-07T20:33:11.9160290Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9160382Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9160461Z         self,
2025-05-07T20:33:11.9160538Z         T: int,
2025-05-07T20:33:11.9160616Z         D: int,
2025-05-07T20:33:11.9160714Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9160804Z         contiguous: bool,
2025-05-07T20:33:11.9160894Z         compiled: bool,
2025-05-07T20:33:11.9160973Z     ) -> None:
2025-05-07T20:33:11.9165421Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9165514Z     
2025-05-07T20:33:11.9165706Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9165781Z     
2025-05-07T20:33:11.9165874Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.9166007Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.9167927Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.9167936Z 
2025-05-07T20:33:11.9168065Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:11.9168070Z 
2025-05-07T20:33:11.9168171Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9168397Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9168482Z     T=128,
2025-05-07T20:33:11.9168559Z     D=5120,
2025-05-07T20:33:11.9168739Z     scale_ub=1200.0,
2025-05-07T20:33:11.9168829Z     contiguous=True,
2025-05-07T20:33:11.9168913Z     compiled=True,
2025-05-07T20:33:11.9168996Z )
2025-05-07T20:33:11.9169227Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9169443Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:11.9169448Z 
2025-05-07T20:33:11.9169532Z     @given(
2025-05-07T20:33:11.9169653Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9169751Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9169908Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9170026Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9170140Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9170220Z     )
2025-05-07T20:33:11.9170470Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9170571Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9170655Z         self,
2025-05-07T20:33:11.9170735Z         T: int,
2025-05-07T20:33:11.9170821Z         D: int,
2025-05-07T20:33:11.9170918Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9171007Z         contiguous: bool,
2025-05-07T20:33:11.9171096Z         compiled: bool,
2025-05-07T20:33:11.9171179Z     ) -> None:
2025-05-07T20:33:11.9171278Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9171357Z     
2025-05-07T20:33:11.9171570Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9171649Z     
2025-05-07T20:33:11.9171759Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.9171887Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.9173794Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.9173805Z 
2025-05-07T20:33:11.9173925Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:11.9173929Z 
2025-05-07T20:33:11.9174033Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.9174262Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.9174462Z     T=128,
2025-05-07T20:33:11.9174543Z     D=7168,
2025-05-07T20:33:11.9174625Z     scale_ub=None,
2025-05-07T20:33:11.9174713Z     contiguous=True,
2025-05-07T20:33:11.9174802Z     compiled=True,
2025-05-07T20:33:11.9174877Z )
2025-05-07T20:33:11.9175096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.9175275Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:11.9175279Z 
2025-05-07T20:33:11.9175355Z     @given(
2025-05-07T20:33:11.9175479Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.9175578Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.9175695Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.9175821Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.9175939Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.9176012Z     )
2025-05-07T20:33:11.9176270Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.9176364Z     def test_silu_mul_quant(
2025-05-07T20:33:11.9176444Z         self,
2025-05-07T20:33:11.9176526Z         T: int,
2025-05-07T20:33:11.9176603Z         D: int,
2025-05-07T20:33:11.9176703Z         scale_ub: Optional[float],
2025-05-07T20:33:11.9176794Z         contiguous: bool,
2025-05-07T20:33:11.9176930Z         compiled: bool,
2025-05-07T20:33:11.9177017Z     ) -> None:
2025-05-07T20:33:11.9177111Z         torch.manual_seed(2025)
2025-05-07T20:33:11.9177184Z     
2025-05-07T20:33:11.9177358Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.9179296Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:11.9179338Z 
2025-05-07T20:33:11.9179500Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:11.9179638Z =============================== warnings summary ===============================
2025-05-07T20:33:11.9179964Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:11.9180274Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:11.9180586Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:11.9181564Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:11.9181806Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:11.9181811Z 
2025-05-07T20:33:11.9182028Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:11.9182200Z ================= 1 failed, 1 deselected, 3 warnings in 15.43s =================
2025-05-07T20:33:13.6125233Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:13.6775278Z [EXEC] [ATTEMPT 0/2] Command attempt failed.
2025-05-07T20:33:13.6775814Z 
2025-05-07T20:33:15.6793614Z [EXEC] [ATTEMPT 1/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:33:17.8497094Z ============================= test session starts ==============================
2025-05-07T20:33:17.8498354Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:33:17.8499427Z cachedir: .pytest_cache
2025-05-07T20:33:17.8500532Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:33:17.8501312Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:33:17.8501744Z plugins: hypothesis-6.131.14
2025-05-07T20:33:19.4957337Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:33:19.6046883Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:33:19.6047321Z run-last-failure: rerun previous 1 failure
2025-05-07T20:33:19.6047545Z 
2025-05-07T20:33:22.0555152Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.0555830Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.0556328Z     T=1,
2025-05-07T20:33:22.0556537Z     D=5120,
2025-05-07T20:33:22.0556754Z     scale_ub=None,
2025-05-07T20:33:22.0556976Z     contiguous=True,
2025-05-07T20:33:22.0557541Z     compiled=True,
2025-05-07T20:33:22.0557756Z )
2025-05-07T20:33:22.0558079Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.0558584Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:22.0558858Z 
2025-05-07T20:33:22.0558948Z     @given(
2025-05-07T20:33:22.0559280Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.0559609Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.0559942Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.0560291Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.0560718Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.0561015Z     )
2025-05-07T20:33:22.0561373Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.0561824Z     def test_silu_mul_quant(
2025-05-07T20:33:22.0562073Z         self,
2025-05-07T20:33:22.0562274Z         T: int,
2025-05-07T20:33:22.0562471Z         D: int,
2025-05-07T20:33:22.0562694Z         scale_ub: Optional[float],
2025-05-07T20:33:22.0562977Z         contiguous: bool,
2025-05-07T20:33:22.0563216Z         compiled: bool,
2025-05-07T20:33:22.0563456Z     ) -> None:
2025-05-07T20:33:22.0563676Z         torch.manual_seed(2025)
2025-05-07T20:33:22.0563922Z     
2025-05-07T20:33:22.0564198Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.0564561Z     
2025-05-07T20:33:22.0564843Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.0565135Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.0565462Z         x = x_sign * x_clamp
2025-05-07T20:33:22.0565714Z         x0 = x[:, :D]
2025-05-07T20:33:22.0565924Z         x1 = x[:, D:]
2025-05-07T20:33:22.0566136Z     
2025-05-07T20:33:22.0566326Z         if contiguous:
2025-05-07T20:33:22.0566559Z             x0 = x0.contiguous()
2025-05-07T20:33:22.0566828Z             x1 = x1.contiguous()
2025-05-07T20:33:22.0567083Z     
2025-05-07T20:33:22.0567271Z         if scale_ub is not None:
2025-05-07T20:33:22.0567555Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.0567904Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.0568211Z             )
2025-05-07T20:33:22.0568409Z         else:
2025-05-07T20:33:22.0568628Z             scale_ub_tensor = None
2025-05-07T20:33:22.0568886Z     
2025-05-07T20:33:22.0569127Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.0569463Z             op = silu_mul_quant
2025-05-07T20:33:22.0569726Z             if compiled:
2025-05-07T20:33:22.0569977Z                 op = torch.compile(op)
2025-05-07T20:33:22.0570286Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.0570572Z     
2025-05-07T20:33:22.0570762Z         y_fp8, y_scale = fn()
2025-05-07T20:33:22.0571057Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:22.0571359Z     
2025-05-07T20:33:22.0571600Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.0571955Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:22.0572260Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:22.0572579Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:22.0572957Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.0573276Z     
2025-05-07T20:33:22.0573485Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:22.0573688Z 
2025-05-07T20:33:22.0573794Z moe/activation_test.py:126: 
2025-05-07T20:33:22.0574104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.0574580Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:22.0574912Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.0575740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:22.0576604Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:22.0577184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.0577903Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.0578683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:22.0579456Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:22.0580227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:22.0580952Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:22.0581589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:22.0582141Z     fn()
2025-05-07T20:33:22.0582679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:22.0583319Z     self.fn.run(
2025-05-07T20:33:22.0583802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.0584369Z     kernel = self.compile(
2025-05-07T20:33:22.0584946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.0585679Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.0586102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.0586350Z 
2025-05-07T20:33:22.0586563Z self = <triton.compiler.compiler.ASTSource object at 0x7f09a1ca4800>
2025-05-07T20:33:22.0587696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.0589148Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09a057dc60>}
2025-05-07T20:33:22.0590557Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.0591652Z context = <triton._C.libtriton.ir.context object at 0x7f09e0d198b0>
2025-05-07T20:33:22.0592001Z 
2025-05-07T20:33:22.0592187Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.0592735Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.0593215Z                            module_map=module_map)
2025-05-07T20:33:22.0593593Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.0593978Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:22.0594252Z E       ^
2025-05-07T20:33:22.0594733Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.0595205Z 
2025-05-07T20:33:22.0595653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.0596194Z 
2025-05-07T20:33:22.0596305Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.0596720Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.0597143Z     T=2048,
2025-05-07T20:33:22.0597332Z     D=5120,
2025-05-07T20:33:22.0597520Z     scale_ub=1200.0,
2025-05-07T20:33:22.0597741Z     contiguous=True,
2025-05-07T20:33:22.0597969Z     compiled=False,
2025-05-07T20:33:22.0598174Z )
2025-05-07T20:33:22.8008956Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.8010212Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:22.8010659Z 
2025-05-07T20:33:22.8010774Z     @given(
2025-05-07T20:33:22.8011118Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.8011597Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.8012224Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.8012727Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.8013228Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.8013580Z     )
2025-05-07T20:33:22.8014067Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.8014773Z     def test_silu_mul_quant(
2025-05-07T20:33:22.8015124Z         self,
2025-05-07T20:33:22.8015405Z         T: int,
2025-05-07T20:33:22.8015685Z         D: int,
2025-05-07T20:33:22.8015984Z         scale_ub: Optional[float],
2025-05-07T20:33:22.8016347Z         contiguous: bool,
2025-05-07T20:33:22.8016600Z         compiled: bool,
2025-05-07T20:33:22.8016834Z     ) -> None:
2025-05-07T20:33:22.8017058Z         torch.manual_seed(2025)
2025-05-07T20:33:22.8017302Z     
2025-05-07T20:33:22.8017578Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.8017943Z     
2025-05-07T20:33:22.8018146Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.8018437Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.8018759Z         x = x_sign * x_clamp
2025-05-07T20:33:22.8019145Z         x0 = x[:, :D]
2025-05-07T20:33:22.8019373Z         x1 = x[:, D:]
2025-05-07T20:33:22.8019589Z     
2025-05-07T20:33:22.8019780Z         if contiguous:
2025-05-07T20:33:22.8020037Z             x0 = x0.contiguous()
2025-05-07T20:33:22.8020295Z             x1 = x1.contiguous()
2025-05-07T20:33:22.8020547Z     
2025-05-07T20:33:22.8020754Z         if scale_ub is not None:
2025-05-07T20:33:22.8021032Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.8021384Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.8021710Z             )
2025-05-07T20:33:22.8021911Z         else:
2025-05-07T20:33:22.8022132Z             scale_ub_tensor = None
2025-05-07T20:33:22.8022398Z     
2025-05-07T20:33:22.8022634Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.8022968Z             op = silu_mul_quant
2025-05-07T20:33:22.8023236Z             if compiled:
2025-05-07T20:33:22.8023485Z                 op = torch.compile(op)
2025-05-07T20:33:22.8023809Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.8024104Z     
2025-05-07T20:33:22.8024304Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:22.8024473Z 
2025-05-07T20:33:22.8024577Z moe/activation_test.py:117: 
2025-05-07T20:33:22.8024887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.8025236Z moe/activation_test.py:115: in fn
2025-05-07T20:33:22.8025898Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.8026637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:22.8027365Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:22.8027930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.8028648Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.8029347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.8029910Z     kernel = self.compile(
2025-05-07T20:33:22.8030471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.8031159Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.8031566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.8031896Z 
2025-05-07T20:33:22.8032113Z self = <triton.compiler.compiler.ASTSource object at 0x7f09a058e960>
2025-05-07T20:33:22.8033294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.8034870Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09a03d4220>}
2025-05-07T20:33:22.8036352Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.8037449Z context = <triton._C.libtriton.ir.context object at 0x7f09a084c3f0>
2025-05-07T20:33:22.8037751Z 
2025-05-07T20:33:22.8037932Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.8038472Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.8038966Z                            module_map=module_map)
2025-05-07T20:33:22.8039352Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.8039729Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:22.8039996Z E       ^
2025-05-07T20:33:22.8040541Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.8041021Z 
2025-05-07T20:33:22.8041466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.8042009Z 
2025-05-07T20:33:22.8042126Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.8042550Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.8042981Z     T=2048,
2025-05-07T20:33:22.8043181Z     D=5120,
2025-05-07T20:33:22.8043381Z     scale_ub=1200.0,
2025-05-07T20:33:22.8043617Z     contiguous=True,
2025-05-07T20:33:22.8043851Z     compiled=True,
2025-05-07T20:33:22.8044060Z )
2025-05-07T20:33:22.8044398Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:22.8044920Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:22.8045205Z 
2025-05-07T20:33:22.8045287Z     @given(
2025-05-07T20:33:22.8045532Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:22.8045865Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:22.8046187Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:22.8046525Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:22.8046870Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:22.8047175Z     )
2025-05-07T20:33:22.8047534Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:22.8048002Z     def test_silu_mul_quant(
2025-05-07T20:33:22.8048258Z         self,
2025-05-07T20:33:22.8048460Z         T: int,
2025-05-07T20:33:22.8048671Z         D: int,
2025-05-07T20:33:22.8048894Z         scale_ub: Optional[float],
2025-05-07T20:33:22.8049169Z         contiguous: bool,
2025-05-07T20:33:22.8049420Z         compiled: bool,
2025-05-07T20:33:22.8049645Z     ) -> None:
2025-05-07T20:33:22.8049858Z         torch.manual_seed(2025)
2025-05-07T20:33:22.8050109Z     
2025-05-07T20:33:22.8050391Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:22.8050745Z     
2025-05-07T20:33:22.8050952Z         x_sign = torch.sign(x)
2025-05-07T20:33:22.8051256Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:22.8051583Z         x = x_sign * x_clamp
2025-05-07T20:33:22.8051828Z         x0 = x[:, :D]
2025-05-07T20:33:22.8052055Z         x1 = x[:, D:]
2025-05-07T20:33:22.8052269Z     
2025-05-07T20:33:22.8052506Z         if contiguous:
2025-05-07T20:33:22.8052745Z             x0 = x0.contiguous()
2025-05-07T20:33:22.8053010Z             x1 = x1.contiguous()
2025-05-07T20:33:22.8053247Z     
2025-05-07T20:33:22.8053443Z         if scale_ub is not None:
2025-05-07T20:33:22.8053717Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:22.8054094Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:22.8054509Z             )
2025-05-07T20:33:22.8054712Z         else:
2025-05-07T20:33:22.8054920Z             scale_ub_tensor = None
2025-05-07T20:33:22.8055222Z     
2025-05-07T20:33:22.8055457Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.8055773Z             op = silu_mul_quant
2025-05-07T20:33:22.8056029Z             if compiled:
2025-05-07T20:33:22.8056280Z                 op = torch.compile(op)
2025-05-07T20:33:22.8056582Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:22.8056859Z     
2025-05-07T20:33:22.8057056Z         y_fp8, y_scale = fn()
2025-05-07T20:33:22.8057352Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:22.8057640Z     
2025-05-07T20:33:22.8057884Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:22.8058233Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:22.8058534Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:22.8058862Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:22.8059284Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.8059597Z     
2025-05-07T20:33:22.8059807Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:22.8060010Z 
2025-05-07T20:33:22.8060116Z moe/activation_test.py:126: 
2025-05-07T20:33:22.8060424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.8060765Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:22.8061113Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:22.8061945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:22.8062734Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:22.8063307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:22.8064024Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:22.8064754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:22.8065511Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:22.8066276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:22.8066948Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:22.8067574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:22.8068113Z     fn()
2025-05-07T20:33:22.8068647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:22.8069258Z     self.fn.run(
2025-05-07T20:33:22.8069744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:22.8070304Z     kernel = self.compile(
2025-05-07T20:33:22.8070866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:22.8071547Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:22.8071949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:22.8072188Z 
2025-05-07T20:33:22.8072397Z self = <triton.compiler.compiler.ASTSource object at 0x7f09a0420ef0>
2025-05-07T20:33:22.8073589Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:22.8075060Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09a03d56c0>}
2025-05-07T20:33:22.8086662Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:22.8087847Z context = <triton._C.libtriton.ir.context object at 0x7f099b1cdcf0>
2025-05-07T20:33:22.8088372Z 
2025-05-07T20:33:22.8088553Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:22.8089098Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:22.8089594Z                            module_map=module_map)
2025-05-07T20:33:22.8089978Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:22.8090359Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:22.8090630Z E       ^
2025-05-07T20:33:22.8091122Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:22.8091601Z 
2025-05-07T20:33:22.8092150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:22.8092707Z 
2025-05-07T20:33:22.8092821Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:22.8093267Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:22.8093703Z     T=16384,
2025-05-07T20:33:22.8093913Z     D=7168,
2025-05-07T20:33:22.8094114Z     scale_ub=1200.0,
2025-05-07T20:33:22.8094448Z     contiguous=False,
2025-05-07T20:33:22.8094694Z     compiled=False,
2025-05-07T20:33:22.8094908Z )
2025-05-07T20:33:23.5601066Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:23.5601626Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:23.5601932Z 
2025-05-07T20:33:23.5602036Z     @given(
2025-05-07T20:33:23.5602310Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:23.5602650Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:23.5602999Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:23.5603376Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:23.5603716Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:23.5604000Z     )
2025-05-07T20:33:23.5604362Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:23.5604826Z     def test_silu_mul_quant(
2025-05-07T20:33:23.5605074Z         self,
2025-05-07T20:33:23.5605268Z         T: int,
2025-05-07T20:33:23.5605468Z         D: int,
2025-05-07T20:33:23.5605690Z         scale_ub: Optional[float],
2025-05-07T20:33:23.5605966Z         contiguous: bool,
2025-05-07T20:33:23.5606209Z         compiled: bool,
2025-05-07T20:33:23.5606438Z     ) -> None:
2025-05-07T20:33:23.5606654Z         torch.manual_seed(2025)
2025-05-07T20:33:23.5606904Z     
2025-05-07T20:33:23.5607190Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:23.5607576Z     
2025-05-07T20:33:23.5607780Z         x_sign = torch.sign(x)
2025-05-07T20:33:23.5608072Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:23.5608393Z         x = x_sign * x_clamp
2025-05-07T20:33:23.5608638Z         x0 = x[:, :D]
2025-05-07T20:33:23.5608847Z         x1 = x[:, D:]
2025-05-07T20:33:23.5609055Z     
2025-05-07T20:33:23.5609242Z         if contiguous:
2025-05-07T20:33:23.5609473Z             x0 = x0.contiguous()
2025-05-07T20:33:23.5609740Z             x1 = x1.contiguous()
2025-05-07T20:33:23.5610143Z     
2025-05-07T20:33:23.5610329Z         if scale_ub is not None:
2025-05-07T20:33:23.5610611Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:23.5610955Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:23.5611267Z             )
2025-05-07T20:33:23.5611553Z         else:
2025-05-07T20:33:23.5611777Z             scale_ub_tensor = None
2025-05-07T20:33:23.5612030Z     
2025-05-07T20:33:23.5612277Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:23.5612645Z             op = silu_mul_quant
2025-05-07T20:33:23.5612983Z             if compiled:
2025-05-07T20:33:23.5613231Z                 op = torch.compile(op)
2025-05-07T20:33:23.5613536Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:23.5613827Z     
2025-05-07T20:33:23.5614016Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:23.5614191Z 
2025-05-07T20:33:23.5614292Z moe/activation_test.py:117: 
2025-05-07T20:33:23.5614710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:23.5615045Z moe/activation_test.py:115: in fn
2025-05-07T20:33:23.5615338Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:23.5616072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:23.5616804Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:23.5617440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:23.5618174Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:23.5618881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:23.5619443Z     kernel = self.compile(
2025-05-07T20:33:23.5620021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:23.5620724Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:23.5621144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:23.5621384Z 
2025-05-07T20:33:23.5621599Z self = <triton.compiler.compiler.ASTSource object at 0x7f09a04e38c0>
2025-05-07T20:33:23.5622749Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:23.5624194Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099b2c0180>}
2025-05-07T20:33:23.5625768Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:23.5626864Z context = <triton._C.libtriton.ir.context object at 0x7f099b225f30>
2025-05-07T20:33:23.5627168Z 
2025-05-07T20:33:23.5627341Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:23.5627893Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:23.5628393Z                            module_map=module_map)
2025-05-07T20:33:23.5628771Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:23.5629145Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:23.5629424Z E       ^
2025-05-07T20:33:23.5629918Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:23.5630393Z 
2025-05-07T20:33:23.5630832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:23.5631506Z 
2025-05-07T20:33:23.5631612Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:23.5632039Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:23.5632453Z     T=1,
2025-05-07T20:33:23.5632648Z     D=7168,
2025-05-07T20:33:23.5632847Z     scale_ub=None,
2025-05-07T20:33:23.5633128Z     contiguous=True,
2025-05-07T20:33:23.5633360Z     compiled=True,
2025-05-07T20:33:23.5633574Z )
2025-05-07T20:33:23.5633908Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:23.5634406Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:23.5634742Z 
2025-05-07T20:33:23.5634822Z     @given(
2025-05-07T20:33:23.5635060Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:23.5635379Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:23.5635699Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:23.5636044Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:23.5636385Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:23.5636691Z     )
2025-05-07T20:33:23.5637053Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:23.5637522Z     def test_silu_mul_quant(
2025-05-07T20:33:23.5637767Z         self,
2025-05-07T20:33:23.5637979Z         T: int,
2025-05-07T20:33:23.5638192Z         D: int,
2025-05-07T20:33:23.5638410Z         scale_ub: Optional[float],
2025-05-07T20:33:23.5638751Z         contiguous: bool,
2025-05-07T20:33:23.5638993Z         compiled: bool,
2025-05-07T20:33:23.5639213Z     ) -> None:
2025-05-07T20:33:23.5639431Z         torch.manual_seed(2025)
2025-05-07T20:33:23.5639676Z     
2025-05-07T20:33:23.5639948Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:23.5640306Z     
2025-05-07T20:33:23.5640503Z         x_sign = torch.sign(x)
2025-05-07T20:33:23.5640793Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:23.5641114Z         x = x_sign * x_clamp
2025-05-07T20:33:23.5641357Z         x0 = x[:, :D]
2025-05-07T20:33:23.5641568Z         x1 = x[:, D:]
2025-05-07T20:33:23.5641784Z     
2025-05-07T20:33:23.5641973Z         if contiguous:
2025-05-07T20:33:23.5642201Z             x0 = x0.contiguous()
2025-05-07T20:33:23.5642516Z             x1 = x1.contiguous()
2025-05-07T20:33:23.5642764Z     
2025-05-07T20:33:23.5642959Z         if scale_ub is not None:
2025-05-07T20:33:23.5643230Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:23.5643573Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:23.5643889Z             )
2025-05-07T20:33:23.5644079Z         else:
2025-05-07T20:33:23.5644292Z             scale_ub_tensor = None
2025-05-07T20:33:23.5644553Z     
2025-05-07T20:33:23.5644780Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:23.5645100Z             op = silu_mul_quant
2025-05-07T20:33:23.5645353Z             if compiled:
2025-05-07T20:33:23.5645599Z                 op = torch.compile(op)
2025-05-07T20:33:23.5645900Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:23.5646180Z     
2025-05-07T20:33:23.5646373Z         y_fp8, y_scale = fn()
2025-05-07T20:33:23.5646656Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:23.5646950Z     
2025-05-07T20:33:23.5647185Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:23.5647524Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:23.5647822Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:23.5648142Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:23.5648503Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:23.5648820Z     
2025-05-07T20:33:23.5649030Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:23.5649230Z 
2025-05-07T20:33:23.5649333Z moe/activation_test.py:126: 
2025-05-07T20:33:23.5649640Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:23.5650072Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:23.5650402Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:23.5651267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:23.5652071Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:23.5652705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:23.5653419Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:23.5654188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:23.5655034Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:23.5655809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:23.5656483Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:23.5657126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:23.5657682Z     fn()
2025-05-07T20:33:23.5658216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:23.5658845Z     self.fn.run(
2025-05-07T20:33:23.5659385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:23.5659957Z     kernel = self.compile(
2025-05-07T20:33:23.5660519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:23.5661209Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:23.5661624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:23.5661867Z 
2025-05-07T20:33:23.5662090Z self = <triton.compiler.compiler.ASTSource object at 0x7f099b405ac0>
2025-05-07T20:33:23.5663219Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:23.5664657Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099b2c0cc0>}
2025-05-07T20:33:23.5666078Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:23.5667168Z context = <triton._C.libtriton.ir.context object at 0x7f099aa7acb0>
2025-05-07T20:33:23.5667471Z 
2025-05-07T20:33:23.5667640Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:23.5668184Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:23.5668685Z                            module_map=module_map)
2025-05-07T20:33:23.5669067Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:23.5669439Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:23.5669710Z E       ^
2025-05-07T20:33:23.5670192Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:23.5670662Z 
2025-05-07T20:33:23.5671102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:23.5671640Z 
2025-05-07T20:33:23.5671748Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:23.5672164Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:23.5672677Z     T=4096,
2025-05-07T20:33:23.5672867Z     D=5120,
2025-05-07T20:33:23.5673060Z     scale_ub=None,
2025-05-07T20:33:23.5673279Z     contiguous=False,
2025-05-07T20:33:23.5673507Z     compiled=False,
2025-05-07T20:33:23.5673703Z )
2025-05-07T20:33:24.3802869Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:24.3804002Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:24.3804591Z 
2025-05-07T20:33:24.3804762Z     @given(
2025-05-07T20:33:24.3805236Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:24.3805982Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:24.3806603Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:24.3807271Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:24.3807933Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:24.3808494Z     )
2025-05-07T20:33:24.3809195Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:24.3810105Z     def test_silu_mul_quant(
2025-05-07T20:33:24.3810572Z         self,
2025-05-07T20:33:24.3810955Z         T: int,
2025-05-07T20:33:24.3811344Z         D: int,
2025-05-07T20:33:24.3811764Z         scale_ub: Optional[float],
2025-05-07T20:33:24.3812310Z         contiguous: bool,
2025-05-07T20:33:24.3812603Z         compiled: bool,
2025-05-07T20:33:24.3812850Z     ) -> None:
2025-05-07T20:33:24.3813139Z         torch.manual_seed(2025)
2025-05-07T20:33:24.3813388Z     
2025-05-07T20:33:24.3813663Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:24.3814018Z     
2025-05-07T20:33:24.3814210Z         x_sign = torch.sign(x)
2025-05-07T20:33:24.3814607Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:24.3814921Z         x = x_sign * x_clamp
2025-05-07T20:33:24.3815161Z         x0 = x[:, :D]
2025-05-07T20:33:24.3815372Z         x1 = x[:, D:]
2025-05-07T20:33:24.3815578Z     
2025-05-07T20:33:24.3815763Z         if contiguous:
2025-05-07T20:33:24.3815994Z             x0 = x0.contiguous()
2025-05-07T20:33:24.3816250Z             x1 = x1.contiguous()
2025-05-07T20:33:24.3816495Z     
2025-05-07T20:33:24.3816687Z         if scale_ub is not None:
2025-05-07T20:33:24.3816961Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:24.3817300Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:24.3817617Z             )
2025-05-07T20:33:24.3817807Z         else:
2025-05-07T20:33:24.3818025Z             scale_ub_tensor = None
2025-05-07T20:33:24.3818283Z     
2025-05-07T20:33:24.3818512Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.3818835Z             op = silu_mul_quant
2025-05-07T20:33:24.3819088Z             if compiled:
2025-05-07T20:33:24.3819331Z                 op = torch.compile(op)
2025-05-07T20:33:24.3819634Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.3819915Z     
2025-05-07T20:33:24.3820117Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:24.3820284Z 
2025-05-07T20:33:24.3820383Z moe/activation_test.py:117: 
2025-05-07T20:33:24.3820688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.3821028Z moe/activation_test.py:115: in fn
2025-05-07T20:33:24.3821311Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.3822039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:24.3822773Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:24.3823333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:24.3824038Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:24.3824730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:24.3826010Z     kernel = self.compile(
2025-05-07T20:33:24.3826643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:24.3827427Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:24.3827957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.3828225Z 
2025-05-07T20:33:24.3828466Z self = <triton.compiler.compiler.ASTSource object at 0x7f099af1ab70>
2025-05-07T20:33:24.3829784Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:24.3831548Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09a03b7240>}
2025-05-07T20:33:24.3833211Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:24.3834467Z context = <triton._C.libtriton.ir.context object at 0x7f099a22dcf0>
2025-05-07T20:33:24.3834810Z 
2025-05-07T20:33:24.3835001Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:24.3835664Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:24.3836154Z                            module_map=module_map)
2025-05-07T20:33:24.3836521Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:24.3836875Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:24.3837141Z E       ^
2025-05-07T20:33:24.3837617Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:24.3838091Z 
2025-05-07T20:33:24.3838532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:24.3839070Z 
2025-05-07T20:33:24.3839171Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:24.3839594Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:24.3840012Z     T=4096,
2025-05-07T20:33:24.3840193Z     D=7168,
2025-05-07T20:33:24.3840386Z     scale_ub=None,
2025-05-07T20:33:24.3840606Z     contiguous=False,
2025-05-07T20:33:24.3840826Z     compiled=False,
2025-05-07T20:33:24.3841034Z )
2025-05-07T20:33:24.3841358Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:24.3841868Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:24.3842150Z 
2025-05-07T20:33:24.3842227Z     @given(
2025-05-07T20:33:24.3842471Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:24.3842821Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:24.3843129Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:24.3843463Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:24.3843798Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:24.3844084Z     )
2025-05-07T20:33:24.3844441Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:24.3844896Z     def test_silu_mul_quant(
2025-05-07T20:33:24.3845134Z         self,
2025-05-07T20:33:24.3845334Z         T: int,
2025-05-07T20:33:24.3845528Z         D: int,
2025-05-07T20:33:24.3845747Z         scale_ub: Optional[float],
2025-05-07T20:33:24.3846017Z         contiguous: bool,
2025-05-07T20:33:24.3846258Z         compiled: bool,
2025-05-07T20:33:24.3846480Z     ) -> None:
2025-05-07T20:33:24.3846693Z         torch.manual_seed(2025)
2025-05-07T20:33:24.3846939Z     
2025-05-07T20:33:24.3847212Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:24.3847630Z     
2025-05-07T20:33:24.3847824Z         x_sign = torch.sign(x)
2025-05-07T20:33:24.3848113Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:24.3848419Z         x = x_sign * x_clamp
2025-05-07T20:33:24.3848655Z         x0 = x[:, :D]
2025-05-07T20:33:24.3848868Z         x1 = x[:, D:]
2025-05-07T20:33:24.3849112Z     
2025-05-07T20:33:24.3849302Z         if contiguous:
2025-05-07T20:33:24.3849534Z             x0 = x0.contiguous()
2025-05-07T20:33:24.3849790Z             x1 = x1.contiguous()
2025-05-07T20:33:24.3850032Z     
2025-05-07T20:33:24.3850263Z         if scale_ub is not None:
2025-05-07T20:33:24.3850532Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:24.3850869Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:24.3851184Z             )
2025-05-07T20:33:24.3851375Z         else:
2025-05-07T20:33:24.3851578Z             scale_ub_tensor = None
2025-05-07T20:33:24.3851831Z     
2025-05-07T20:33:24.3852068Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.3852388Z             op = silu_mul_quant
2025-05-07T20:33:24.3852664Z             if compiled:
2025-05-07T20:33:24.3852936Z                 op = torch.compile(op)
2025-05-07T20:33:24.3853231Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.3853508Z     
2025-05-07T20:33:24.3853706Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:24.3853870Z 
2025-05-07T20:33:24.3853967Z moe/activation_test.py:117: 
2025-05-07T20:33:24.3854341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.3854751Z moe/activation_test.py:115: in fn
2025-05-07T20:33:24.3855044Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.3855761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:24.3856486Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:24.3857048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:24.3857762Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:24.3858457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:24.3859019Z     kernel = self.compile(
2025-05-07T20:33:24.3859582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:24.3860264Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:24.3860678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.3860912Z 
2025-05-07T20:33:24.3861129Z self = <triton.compiler.compiler.ASTSource object at 0x7f099b16ede0>
2025-05-07T20:33:24.3862254Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:24.3863669Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099a879ee0>}
2025-05-07T20:33:24.3865074Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:24.3866158Z context = <triton._C.libtriton.ir.context object at 0x7f099aac1ef0>
2025-05-07T20:33:24.3866457Z 
2025-05-07T20:33:24.3866632Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:24.3867162Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:24.3867646Z                            module_map=module_map)
2025-05-07T20:33:24.3868061Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:24.3868420Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:24.3868679Z E       ^
2025-05-07T20:33:24.3869151Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:24.3869661Z 
2025-05-07T20:33:24.3870106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:24.3870647Z 
2025-05-07T20:33:24.3870755Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:24.3871217Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:24.3871636Z     T=128,
2025-05-07T20:33:24.3871828Z     D=7168,
2025-05-07T20:33:24.3872011Z     scale_ub=None,
2025-05-07T20:33:24.3872223Z     contiguous=False,
2025-05-07T20:33:24.3872448Z     compiled=True,
2025-05-07T20:33:24.3872669Z )
2025-05-07T20:33:24.4430973Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:24.4431509Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:24.4431836Z 
2025-05-07T20:33:24.4431934Z     @given(
2025-05-07T20:33:24.4432166Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:24.4432499Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:24.4432820Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:24.4433167Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:24.4433592Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:24.4441196Z     )
2025-05-07T20:33:24.4441591Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:24.4442056Z     def test_silu_mul_quant(
2025-05-07T20:33:24.4442311Z         self,
2025-05-07T20:33:24.4442515Z         T: int,
2025-05-07T20:33:24.4442714Z         D: int,
2025-05-07T20:33:24.4442948Z         scale_ub: Optional[float],
2025-05-07T20:33:24.4443242Z         contiguous: bool,
2025-05-07T20:33:24.4443485Z         compiled: bool,
2025-05-07T20:33:24.4443719Z     ) -> None:
2025-05-07T20:33:24.4443949Z         torch.manual_seed(2025)
2025-05-07T20:33:24.4444201Z     
2025-05-07T20:33:24.4444488Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:24.4444855Z     
2025-05-07T20:33:24.4445053Z         x_sign = torch.sign(x)
2025-05-07T20:33:24.4445357Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:24.4445688Z         x = x_sign * x_clamp
2025-05-07T20:33:24.4445931Z         x0 = x[:, :D]
2025-05-07T20:33:24.4446152Z         x1 = x[:, D:]
2025-05-07T20:33:24.4446366Z     
2025-05-07T20:33:24.4446554Z         if contiguous:
2025-05-07T20:33:24.4446798Z             x0 = x0.contiguous()
2025-05-07T20:33:24.4447069Z             x1 = x1.contiguous()
2025-05-07T20:33:24.4447325Z     
2025-05-07T20:33:24.4447520Z         if scale_ub is not None:
2025-05-07T20:33:24.4447807Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:24.4448162Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:24.4448481Z             )
2025-05-07T20:33:24.4448687Z         else:
2025-05-07T20:33:24.4448908Z             scale_ub_tensor = None
2025-05-07T20:33:24.4449163Z     
2025-05-07T20:33:24.4449405Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.4449733Z             op = silu_mul_quant
2025-05-07T20:33:24.4449985Z             if compiled:
2025-05-07T20:33:24.4450243Z                 op = torch.compile(op)
2025-05-07T20:33:24.4450551Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.4450834Z     
2025-05-07T20:33:24.4451035Z         y_fp8, y_scale = fn()
2025-05-07T20:33:24.4451332Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:24.4451629Z     
2025-05-07T20:33:24.4451871Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.4452221Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:24.4452637Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:24.4452990Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:24.4453375Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:24.4453702Z     
2025-05-07T20:33:24.4453969Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:24.4454181Z 
2025-05-07T20:33:24.4454287Z moe/activation_test.py:126: 
2025-05-07T20:33:24.4454720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.4455060Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:24.4455470Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:24.4456300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:24.4457100Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:24.4457667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:24.4458390Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:24.4459125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:24.4459890Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:24.4460699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:24.4461471Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:24.4462192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:24.4462808Z     fn()
2025-05-07T20:33:24.4463414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:24.4464117Z     self.fn.run(
2025-05-07T20:33:24.4464667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:24.4465290Z     kernel = self.compile(
2025-05-07T20:33:24.4465931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:24.4466716Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:24.4467175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.4467450Z 
2025-05-07T20:33:24.4467687Z self = <triton.compiler.compiler.ASTSource object at 0x7f099b00a4e0>
2025-05-07T20:33:24.4469031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:24.4470740Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099a7f9120>}
2025-05-07T20:33:24.4472408Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:24.4473668Z context = <triton._C.libtriton.ir.context object at 0x7f099a8d5eb0>
2025-05-07T20:33:24.4474011Z 
2025-05-07T20:33:24.4474198Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:24.4474818Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:24.4475371Z                            module_map=module_map)
2025-05-07T20:33:24.4475784Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:24.4476189Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:24.4476542Z E       ^
2025-05-07T20:33:24.4477085Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:24.4477640Z 
2025-05-07T20:33:24.4478146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:24.4478819Z 
2025-05-07T20:33:24.4478934Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:24.4479418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:24.4479888Z     T=128,
2025-05-07T20:33:24.4480133Z     D=7168,
2025-05-07T20:33:24.4480345Z     scale_ub=None,
2025-05-07T20:33:24.4480580Z     contiguous=False,
2025-05-07T20:33:24.4480830Z     compiled=False,
2025-05-07T20:33:24.4481056Z )
2025-05-07T20:33:24.6467609Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:24.6468163Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:24.6468470Z 
2025-05-07T20:33:24.6468551Z     @given(
2025-05-07T20:33:24.6468823Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:24.6469148Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:24.6469451Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:24.6469790Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:24.6470132Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:24.6470417Z     )
2025-05-07T20:33:24.6470881Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:24.6471349Z     def test_silu_mul_quant(
2025-05-07T20:33:24.6471595Z         self,
2025-05-07T20:33:24.6471806Z         T: int,
2025-05-07T20:33:24.6472015Z         D: int,
2025-05-07T20:33:24.6472234Z         scale_ub: Optional[float],
2025-05-07T20:33:24.6472506Z         contiguous: bool,
2025-05-07T20:33:24.6472749Z         compiled: bool,
2025-05-07T20:33:24.6472977Z     ) -> None:
2025-05-07T20:33:24.6473190Z         torch.manual_seed(2025)
2025-05-07T20:33:24.6473429Z     
2025-05-07T20:33:24.6473711Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:24.6474058Z     
2025-05-07T20:33:24.6474255Z         x_sign = torch.sign(x)
2025-05-07T20:33:24.6474550Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:24.6474858Z         x = x_sign * x_clamp
2025-05-07T20:33:24.6475099Z         x0 = x[:, :D]
2025-05-07T20:33:24.6475308Z         x1 = x[:, D:]
2025-05-07T20:33:24.6475509Z     
2025-05-07T20:33:24.6475692Z         if contiguous:
2025-05-07T20:33:24.6475929Z             x0 = x0.contiguous()
2025-05-07T20:33:24.6476181Z             x1 = x1.contiguous()
2025-05-07T20:33:24.6476431Z     
2025-05-07T20:33:24.6476625Z         if scale_ub is not None:
2025-05-07T20:33:24.6476902Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:24.6477242Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:24.6477552Z             )
2025-05-07T20:33:24.6477744Z         else:
2025-05-07T20:33:24.6477948Z             scale_ub_tensor = None
2025-05-07T20:33:24.6478203Z     
2025-05-07T20:33:24.6478433Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.6478746Z             op = silu_mul_quant
2025-05-07T20:33:24.6479004Z             if compiled:
2025-05-07T20:33:24.6479250Z                 op = torch.compile(op)
2025-05-07T20:33:24.6479545Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.6479821Z     
2025-05-07T20:33:24.6480020Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:24.6480186Z 
2025-05-07T20:33:24.6480284Z moe/activation_test.py:117: 
2025-05-07T20:33:24.6480580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.6480915Z moe/activation_test.py:115: in fn
2025-05-07T20:33:24.6481199Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.6481913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:24.6482712Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:24.6483268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:24.6484034Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:24.6484759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:24.6485318Z     kernel = self.compile(
2025-05-07T20:33:24.6485942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:24.6486629Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:24.6487033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.6487273Z 
2025-05-07T20:33:24.6487485Z self = <triton.compiler.compiler.ASTSource object at 0x7f099b2f1a30>
2025-05-07T20:33:24.6488610Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:24.6490038Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099a4b8b80>}
2025-05-07T20:33:24.6491476Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:24.6492567Z context = <triton._C.libtriton.ir.context object at 0x7f099a92e370>
2025-05-07T20:33:24.6492875Z 
2025-05-07T20:33:24.6493047Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:24.6493589Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:24.6494074Z                            module_map=module_map)
2025-05-07T20:33:24.6494517Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:24.6494885Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:24.6495155Z E       ^
2025-05-07T20:33:24.6495633Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:24.6496108Z 
2025-05-07T20:33:24.6496546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:24.6497094Z 
2025-05-07T20:33:24.6497209Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:24.6497635Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:24.6498050Z     T=4096,
2025-05-07T20:33:24.6498244Z     D=5120,
2025-05-07T20:33:24.6498441Z     scale_ub=1200.0,
2025-05-07T20:33:24.6498664Z     contiguous=True,
2025-05-07T20:33:24.6498895Z     compiled=False,
2025-05-07T20:33:24.6499113Z )
2025-05-07T20:33:24.6499438Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:24.6499958Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:24.6500244Z 
2025-05-07T20:33:24.6500329Z     @given(
2025-05-07T20:33:24.6500556Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:24.6500885Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:24.6501199Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:24.6501533Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:24.6501875Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:24.6502171Z     )
2025-05-07T20:33:24.6502534Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:24.6502996Z     def test_silu_mul_quant(
2025-05-07T20:33:24.6503297Z         self,
2025-05-07T20:33:24.6503493Z         T: int,
2025-05-07T20:33:24.6503696Z         D: int,
2025-05-07T20:33:24.6503912Z         scale_ub: Optional[float],
2025-05-07T20:33:24.6504180Z         contiguous: bool,
2025-05-07T20:33:24.6504426Z         compiled: bool,
2025-05-07T20:33:24.6504638Z     ) -> None:
2025-05-07T20:33:24.6504890Z         torch.manual_seed(2025)
2025-05-07T20:33:24.6505135Z     
2025-05-07T20:33:24.6505408Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:24.6505768Z     
2025-05-07T20:33:24.6505968Z         x_sign = torch.sign(x)
2025-05-07T20:33:24.6506325Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:24.6506647Z         x = x_sign * x_clamp
2025-05-07T20:33:24.6506887Z         x0 = x[:, :D]
2025-05-07T20:33:24.6507098Z         x1 = x[:, D:]
2025-05-07T20:33:24.6507309Z     
2025-05-07T20:33:24.6507502Z         if contiguous:
2025-05-07T20:33:24.6507732Z             x0 = x0.contiguous()
2025-05-07T20:33:24.6508001Z             x1 = x1.contiguous()
2025-05-07T20:33:24.6508249Z     
2025-05-07T20:33:24.6508438Z         if scale_ub is not None:
2025-05-07T20:33:24.6508716Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:24.6509059Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:24.6509388Z             )
2025-05-07T20:33:24.6509584Z         else:
2025-05-07T20:33:24.6509792Z             scale_ub_tensor = None
2025-05-07T20:33:24.6510046Z     
2025-05-07T20:33:24.6510313Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:24.6510640Z             op = silu_mul_quant
2025-05-07T20:33:24.6510891Z             if compiled:
2025-05-07T20:33:24.6511130Z                 op = torch.compile(op)
2025-05-07T20:33:24.6511424Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.6511700Z     
2025-05-07T20:33:24.6511885Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:24.6512055Z 
2025-05-07T20:33:24.6512151Z moe/activation_test.py:117: 
2025-05-07T20:33:24.6512453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.6512780Z moe/activation_test.py:115: in fn
2025-05-07T20:33:24.6513060Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:24.6513779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:24.6514500Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:24.6515051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:24.6515762Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:24.6516452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:24.6517004Z     kernel = self.compile(
2025-05-07T20:33:24.6517552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:24.6518235Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:24.6518633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:24.6518867Z 
2025-05-07T20:33:24.6519075Z self = <triton.compiler.compiler.ASTSource object at 0x7f099b2f2360>
2025-05-07T20:33:24.6520191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:24.6521607Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099a4b9b20>}
2025-05-07T20:33:24.6523007Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:24.6524137Z context = <triton._C.libtriton.ir.context object at 0x7f08e5b33e30>
2025-05-07T20:33:24.6524438Z 
2025-05-07T20:33:24.6524610Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:24.6525188Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:24.6525847Z                            module_map=module_map)
2025-05-07T20:33:24.6526227Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:24.6526662Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:24.6526929Z E       ^
2025-05-07T20:33:24.6527406Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:24.6527878Z 
2025-05-07T20:33:24.6528314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:24.6528864Z 
2025-05-07T20:33:24.6528969Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:24.6529393Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:24.6529808Z     T=1,
2025-05-07T20:33:24.6529991Z     D=5120,
2025-05-07T20:33:24.6530189Z     scale_ub=None,
2025-05-07T20:33:24.6530410Z     contiguous=True,
2025-05-07T20:33:24.6530634Z     compiled=True,
2025-05-07T20:33:24.6530840Z )
2025-05-07T20:33:25.0445319Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:25.0446710Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:25.0447258Z 
2025-05-07T20:33:25.0447439Z     @given(
2025-05-07T20:33:25.0447912Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:25.0448568Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:25.0449201Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:25.0449874Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:25.0450560Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:25.0451142Z     )
2025-05-07T20:33:25.0451854Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:25.0452598Z     def test_silu_mul_quant(
2025-05-07T20:33:25.0452882Z         self,
2025-05-07T20:33:25.0453090Z         T: int,
2025-05-07T20:33:25.0453288Z         D: int,
2025-05-07T20:33:25.0453515Z         scale_ub: Optional[float],
2025-05-07T20:33:25.0453805Z         contiguous: bool,
2025-05-07T20:33:25.0454056Z         compiled: bool,
2025-05-07T20:33:25.0454283Z     ) -> None:
2025-05-07T20:33:25.0454625Z         torch.manual_seed(2025)
2025-05-07T20:33:25.0454864Z     
2025-05-07T20:33:25.0455142Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:25.0455493Z     
2025-05-07T20:33:25.0455684Z         x_sign = torch.sign(x)
2025-05-07T20:33:25.0455978Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:25.0456297Z         x = x_sign * x_clamp
2025-05-07T20:33:25.0456536Z         x0 = x[:, :D]
2025-05-07T20:33:25.0456759Z         x1 = x[:, D:]
2025-05-07T20:33:25.0456969Z     
2025-05-07T20:33:25.0457157Z         if contiguous:
2025-05-07T20:33:25.0457399Z             x0 = x0.contiguous()
2025-05-07T20:33:25.0457669Z             x1 = x1.contiguous()
2025-05-07T20:33:25.0457924Z     
2025-05-07T20:33:25.0458118Z         if scale_ub is not None:
2025-05-07T20:33:25.0458407Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:25.0458753Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:25.0459073Z             )
2025-05-07T20:33:25.0459269Z         else:
2025-05-07T20:33:25.0459490Z             scale_ub_tensor = None
2025-05-07T20:33:25.0459748Z     
2025-05-07T20:33:25.0459985Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.0460301Z             op = silu_mul_quant
2025-05-07T20:33:25.0460547Z             if compiled:
2025-05-07T20:33:25.0460886Z                 op = torch.compile(op)
2025-05-07T20:33:25.0461188Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.0461464Z     
2025-05-07T20:33:25.0461660Z         y_fp8, y_scale = fn()
2025-05-07T20:33:25.0461951Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:25.0462324Z     
2025-05-07T20:33:25.0462560Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.0462902Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:25.0463202Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:25.0463580Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:25.0463947Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:25.0464265Z     
2025-05-07T20:33:25.0464458Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:25.0464664Z 
2025-05-07T20:33:25.0464763Z moe/activation_test.py:126: 
2025-05-07T20:33:25.0465061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.0465403Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:25.0465729Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:25.0466559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:25.0467354Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:25.0467986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:25.0468716Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:25.0469446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:25.0470210Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:25.0470976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:25.0471658Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:25.0472292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:25.0472836Z     fn()
2025-05-07T20:33:25.0473358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:25.0473984Z     self.fn.run(
2025-05-07T20:33:25.0474473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:25.0475024Z     kernel = self.compile(
2025-05-07T20:33:25.0475586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:25.0476270Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:25.0476671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.0476917Z 
2025-05-07T20:33:25.0477128Z self = <triton.compiler.compiler.ASTSource object at 0x7f099a7fe900>
2025-05-07T20:33:25.0478261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:25.0479699Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099a4baca0>}
2025-05-07T20:33:25.0481113Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:25.0482192Z context = <triton._C.libtriton.ir.context object at 0x7f099a94d8b0>
2025-05-07T20:33:25.0482584Z 
2025-05-07T20:33:25.0482778Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:25.0483320Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:25.0483928Z                            module_map=module_map)
2025-05-07T20:33:25.0484302Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:25.0484676Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:25.0484956Z E       ^
2025-05-07T20:33:25.0485430Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:25.0485950Z 
2025-05-07T20:33:25.0486391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:25.0486941Z 
2025-05-07T20:33:25.0487046Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:25.0487474Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:25.0487899Z     T=2048,
2025-05-07T20:33:25.0488100Z     D=5120,
2025-05-07T20:33:25.0488307Z     scale_ub=None,
2025-05-07T20:33:25.0488514Z     contiguous=True,
2025-05-07T20:33:25.0488739Z     compiled=True,
2025-05-07T20:33:25.0488946Z )
2025-05-07T20:33:25.4246051Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:25.4246630Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:25.4247131Z 
2025-05-07T20:33:25.4247230Z     @given(
2025-05-07T20:33:25.4247471Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:25.4247801Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:25.4248119Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:25.4248468Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:25.4248807Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:25.4249105Z     )
2025-05-07T20:33:25.4249458Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:25.4249909Z     def test_silu_mul_quant(
2025-05-07T20:33:25.4250155Z         self,
2025-05-07T20:33:25.4250350Z         T: int,
2025-05-07T20:33:25.4250548Z         D: int,
2025-05-07T20:33:25.4250764Z         scale_ub: Optional[float],
2025-05-07T20:33:25.4256994Z         contiguous: bool,
2025-05-07T20:33:25.4257281Z         compiled: bool,
2025-05-07T20:33:25.4257509Z     ) -> None:
2025-05-07T20:33:25.4257731Z         torch.manual_seed(2025)
2025-05-07T20:33:25.4257980Z     
2025-05-07T20:33:25.4258249Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:25.4258601Z     
2025-05-07T20:33:25.4258794Z         x_sign = torch.sign(x)
2025-05-07T20:33:25.4259083Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:25.4259407Z         x = x_sign * x_clamp
2025-05-07T20:33:25.4259648Z         x0 = x[:, :D]
2025-05-07T20:33:25.4259861Z         x1 = x[:, D:]
2025-05-07T20:33:25.4260067Z     
2025-05-07T20:33:25.4260245Z         if contiguous:
2025-05-07T20:33:25.4260468Z             x0 = x0.contiguous()
2025-05-07T20:33:25.4260732Z             x1 = x1.contiguous()
2025-05-07T20:33:25.4260972Z     
2025-05-07T20:33:25.4261163Z         if scale_ub is not None:
2025-05-07T20:33:25.4261436Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:25.4261773Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:25.4262084Z             )
2025-05-07T20:33:25.4262267Z         else:
2025-05-07T20:33:25.4262476Z             scale_ub_tensor = None
2025-05-07T20:33:25.4262728Z     
2025-05-07T20:33:25.4262952Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.4263322Z             op = silu_mul_quant
2025-05-07T20:33:25.4263575Z             if compiled:
2025-05-07T20:33:25.4263818Z                 op = torch.compile(op)
2025-05-07T20:33:25.4264121Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.4264513Z     
2025-05-07T20:33:25.4264696Z         y_fp8, y_scale = fn()
2025-05-07T20:33:25.4264988Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:25.4265285Z     
2025-05-07T20:33:25.4265512Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.4265919Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:25.4266221Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:25.4266542Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:25.4266904Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:25.4267279Z     
2025-05-07T20:33:25.4267475Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:25.4267673Z 
2025-05-07T20:33:25.4267773Z moe/activation_test.py:126: 
2025-05-07T20:33:25.4268071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.4268411Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:25.4268738Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:25.4269561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:25.4270361Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:25.4270931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:25.4271680Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:25.4272401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:25.4273156Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:25.4273918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:25.4274580Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:25.4275204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:25.4275745Z     fn()
2025-05-07T20:33:25.4276264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:25.4276880Z     self.fn.run(
2025-05-07T20:33:25.4277363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:25.4277918Z     kernel = self.compile(
2025-05-07T20:33:25.4278475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:25.4279157Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:25.4279564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.4279801Z 
2025-05-07T20:33:25.4280014Z self = <triton.compiler.compiler.ASTSource object at 0x7f099a14ea20>
2025-05-07T20:33:25.4281134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:25.4282561Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099a565e40>}
2025-05-07T20:33:25.4283967Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:25.4285043Z context = <triton._C.libtriton.ir.context object at 0x7f08e5a84cb0>
2025-05-07T20:33:25.4285346Z 
2025-05-07T20:33:25.4285514Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:25.4286097Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:25.4286582Z                            module_map=module_map)
2025-05-07T20:33:25.4286956Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:25.4287314Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:25.4287626Z E       ^
2025-05-07T20:33:25.4288104Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:25.4288578Z 
2025-05-07T20:33:25.4289013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:25.4289605Z 
2025-05-07T20:33:25.4289714Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:25.4290145Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:25.4290567Z     T=128,
2025-05-07T20:33:25.4290753Z     D=5120,
2025-05-07T20:33:25.4290953Z     scale_ub=None,
2025-05-07T20:33:25.4291170Z     contiguous=True,
2025-05-07T20:33:25.4291394Z     compiled=True,
2025-05-07T20:33:25.4291599Z )
2025-05-07T20:33:25.8700993Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:25.8701582Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:25.8701864Z 
2025-05-07T20:33:25.8701958Z     @given(
2025-05-07T20:33:25.8702198Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:25.8702673Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:25.8703014Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:25.8703406Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:25.8703754Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:25.8704062Z     )
2025-05-07T20:33:25.8704422Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:25.8704874Z     def test_silu_mul_quant(
2025-05-07T20:33:25.8705125Z         self,
2025-05-07T20:33:25.8705318Z         T: int,
2025-05-07T20:33:25.8705510Z         D: int,
2025-05-07T20:33:25.8705732Z         scale_ub: Optional[float],
2025-05-07T20:33:25.8706007Z         contiguous: bool,
2025-05-07T20:33:25.8706241Z         compiled: bool,
2025-05-07T20:33:25.8706475Z     ) -> None:
2025-05-07T20:33:25.8706692Z         torch.manual_seed(2025)
2025-05-07T20:33:25.8706929Z     
2025-05-07T20:33:25.8707215Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:25.8707573Z     
2025-05-07T20:33:25.8707774Z         x_sign = torch.sign(x)
2025-05-07T20:33:25.8708061Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:25.8708383Z         x = x_sign * x_clamp
2025-05-07T20:33:25.8708632Z         x0 = x[:, :D]
2025-05-07T20:33:25.8708846Z         x1 = x[:, D:]
2025-05-07T20:33:25.8709064Z     
2025-05-07T20:33:25.8709259Z         if contiguous:
2025-05-07T20:33:25.8709492Z             x0 = x0.contiguous()
2025-05-07T20:33:25.8709756Z             x1 = x1.contiguous()
2025-05-07T20:33:25.8710013Z     
2025-05-07T20:33:25.8710197Z         if scale_ub is not None:
2025-05-07T20:33:25.8710477Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:25.8710820Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:25.8711134Z             )
2025-05-07T20:33:25.8711331Z         else:
2025-05-07T20:33:25.8711546Z             scale_ub_tensor = None
2025-05-07T20:33:25.8711800Z     
2025-05-07T20:33:25.8712045Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.8712370Z             op = silu_mul_quant
2025-05-07T20:33:25.8712615Z             if compiled:
2025-05-07T20:33:25.8712869Z                 op = torch.compile(op)
2025-05-07T20:33:25.8713177Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:25.8713506Z     
2025-05-07T20:33:25.8713693Z         y_fp8, y_scale = fn()
2025-05-07T20:33:25.8713978Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:25.8714359Z     
2025-05-07T20:33:25.8714593Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:25.8714942Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:25.8715247Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:25.8715633Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:25.8716003Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:25.8716321Z     
2025-05-07T20:33:25.8716520Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:25.8716729Z 
2025-05-07T20:33:25.8716898Z moe/activation_test.py:126: 
2025-05-07T20:33:25.8717208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.8717559Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:25.8717887Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:25.8718718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:25.8719520Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:25.8720080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:25.8720800Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:25.8721571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:25.8722336Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:25.8723111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:25.8723802Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:25.8724439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:25.8724992Z     fn()
2025-05-07T20:33:25.8725676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:25.8726303Z     self.fn.run(
2025-05-07T20:33:25.8726801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:25.8727355Z     kernel = self.compile(
2025-05-07T20:33:25.8727930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:25.8728627Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:25.8729045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:25.8729292Z 
2025-05-07T20:33:25.8729507Z self = <triton.compiler.compiler.ASTSource object at 0x7f099a59f2f0>
2025-05-07T20:33:25.8730645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:25.8732096Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e5902ac0>}
2025-05-07T20:33:25.8733512Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:25.8734669Z context = <triton._C.libtriton.ir.context object at 0x7f099a343230>
2025-05-07T20:33:25.8734970Z 
2025-05-07T20:33:25.8735140Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:25.8735685Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:25.8736171Z                            module_map=module_map)
2025-05-07T20:33:25.8736617Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:25.8736988Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:25.8737273Z E       ^
2025-05-07T20:33:25.8737812Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:25.8738288Z 
2025-05-07T20:33:25.8738727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:25.8739282Z 
2025-05-07T20:33:25.8739484Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:25.8739919Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:25.8740334Z     T=4096,
2025-05-07T20:33:25.8740538Z     D=5120,
2025-05-07T20:33:25.8740735Z     scale_ub=None,
2025-05-07T20:33:25.8740949Z     contiguous=True,
2025-05-07T20:33:25.8741174Z     compiled=True,
2025-05-07T20:33:25.8741382Z )
2025-05-07T20:33:26.3158418Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.3158969Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:26.3159255Z 
2025-05-07T20:33:26.3159338Z     @given(
2025-05-07T20:33:26.3159596Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.3159927Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.3160236Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.3160708Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.3161054Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.3161345Z     )
2025-05-07T20:33:26.3161694Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.3162149Z     def test_silu_mul_quant(
2025-05-07T20:33:26.3162395Z         self,
2025-05-07T20:33:26.3162585Z         T: int,
2025-05-07T20:33:26.3162788Z         D: int,
2025-05-07T20:33:26.3163015Z         scale_ub: Optional[float],
2025-05-07T20:33:26.3163286Z         contiguous: bool,
2025-05-07T20:33:26.3163532Z         compiled: bool,
2025-05-07T20:33:26.3163758Z     ) -> None:
2025-05-07T20:33:26.3163971Z         torch.manual_seed(2025)
2025-05-07T20:33:26.3164220Z     
2025-05-07T20:33:26.3164504Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.3164882Z     
2025-05-07T20:33:26.3165084Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.3165393Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.3165722Z         x = x_sign * x_clamp
2025-05-07T20:33:26.3165978Z         x0 = x[:, :D]
2025-05-07T20:33:26.3166210Z         x1 = x[:, D:]
2025-05-07T20:33:26.3166431Z     
2025-05-07T20:33:26.3166621Z         if contiguous:
2025-05-07T20:33:26.3166869Z             x0 = x0.contiguous()
2025-05-07T20:33:26.3167150Z             x1 = x1.contiguous()
2025-05-07T20:33:26.3167404Z     
2025-05-07T20:33:26.3167604Z         if scale_ub is not None:
2025-05-07T20:33:26.3167897Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.3168249Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.3168567Z             )
2025-05-07T20:33:26.3168775Z         else:
2025-05-07T20:33:26.3169003Z             scale_ub_tensor = None
2025-05-07T20:33:26.3169263Z     
2025-05-07T20:33:26.3169513Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.3169844Z             op = silu_mul_quant
2025-05-07T20:33:26.3170104Z             if compiled:
2025-05-07T20:33:26.3170361Z                 op = torch.compile(op)
2025-05-07T20:33:26.3170675Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.3170957Z     
2025-05-07T20:33:26.3171154Z         y_fp8, y_scale = fn()
2025-05-07T20:33:26.3171455Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:26.3171770Z     
2025-05-07T20:33:26.3172014Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.3172448Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:26.3172759Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:26.3173082Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:26.3173461Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:26.3173790Z     
2025-05-07T20:33:26.3174092Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:26.3174307Z 
2025-05-07T20:33:26.3174517Z moe/activation_test.py:126: 
2025-05-07T20:33:26.3174834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.3175263Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:26.3175604Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:26.3176443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:26.3177248Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:26.3177820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.3178541Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.3179275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:26.3180039Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:26.3180853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:26.3181536Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:26.3182176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:26.3182730Z     fn()
2025-05-07T20:33:26.3183256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:26.3183883Z     self.fn.run(
2025-05-07T20:33:26.3184553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.3185117Z     kernel = self.compile(
2025-05-07T20:33:26.3185699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.3186405Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.3186833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.3187095Z 
2025-05-07T20:33:26.3187312Z self = <triton.compiler.compiler.ASTSource object at 0x7f099a59ef90>
2025-05-07T20:33:26.3188460Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.3189934Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e5d734c0>}
2025-05-07T20:33:26.3191369Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.3192466Z context = <triton._C.libtriton.ir.context object at 0x7f08e5f661f0>
2025-05-07T20:33:26.3192784Z 
2025-05-07T20:33:26.3192962Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.3193521Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.3194030Z                            module_map=module_map)
2025-05-07T20:33:26.3194412Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.3194794Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:26.3195142Z E       ^
2025-05-07T20:33:26.3195626Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.3196120Z 
2025-05-07T20:33:26.3196603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:26.3197160Z 
2025-05-07T20:33:26.3197271Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:26.3197716Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:26.3198181Z     T=16384,
2025-05-07T20:33:26.3198400Z     D=5120,
2025-05-07T20:33:26.3198615Z     scale_ub=None,
2025-05-07T20:33:26.3198837Z     contiguous=True,
2025-05-07T20:33:26.3199081Z     compiled=True,
2025-05-07T20:33:26.3199301Z )
2025-05-07T20:33:26.3459530Z W0507 20:33:26.343000 96495 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:33:26.3462139Z W0507 20:33:26.343000 96495 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:33:26.3464117Z W0507 20:33:26.343000 96495 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:33:26.3465258Z W0507 20:33:26.343000 96495 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:33:26.3466420Z W0507 20:33:26.343000 96495 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:33:26.4340479Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.4341045Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:26.4341341Z 
2025-05-07T20:33:26.4341425Z     @given(
2025-05-07T20:33:26.4341670Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.4341997Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.4342314Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.4342660Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.4343003Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.4343297Z     )
2025-05-07T20:33:26.4343655Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.4344118Z     def test_silu_mul_quant(
2025-05-07T20:33:26.4344360Z         self,
2025-05-07T20:33:26.4344559Z         T: int,
2025-05-07T20:33:26.4344758Z         D: int,
2025-05-07T20:33:26.4344978Z         scale_ub: Optional[float],
2025-05-07T20:33:26.4345248Z         contiguous: bool,
2025-05-07T20:33:26.4345490Z         compiled: bool,
2025-05-07T20:33:26.4345723Z     ) -> None:
2025-05-07T20:33:26.4345935Z         torch.manual_seed(2025)
2025-05-07T20:33:26.4346176Z     
2025-05-07T20:33:26.4346455Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.4346803Z     
2025-05-07T20:33:26.4347003Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.4347299Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.4347608Z         x = x_sign * x_clamp
2025-05-07T20:33:26.4347850Z         x0 = x[:, :D]
2025-05-07T20:33:26.4348068Z         x1 = x[:, D:]
2025-05-07T20:33:26.4348268Z     
2025-05-07T20:33:26.4348457Z         if contiguous:
2025-05-07T20:33:26.4348690Z             x0 = x0.contiguous()
2025-05-07T20:33:26.4348944Z             x1 = x1.contiguous()
2025-05-07T20:33:26.4349190Z     
2025-05-07T20:33:26.4349382Z         if scale_ub is not None:
2025-05-07T20:33:26.4349652Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.4350000Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.4350425Z             )
2025-05-07T20:33:26.4350621Z         else:
2025-05-07T20:33:26.4350828Z             scale_ub_tensor = None
2025-05-07T20:33:26.4351088Z     
2025-05-07T20:33:26.4351320Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.4351698Z             op = silu_mul_quant
2025-05-07T20:33:26.4351958Z             if compiled:
2025-05-07T20:33:26.4352211Z                 op = torch.compile(op)
2025-05-07T20:33:26.4352511Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.4352795Z     
2025-05-07T20:33:26.4353052Z         y_fp8, y_scale = fn()
2025-05-07T20:33:26.4353336Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:26.4353634Z     
2025-05-07T20:33:26.4353871Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.4354207Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:26.4354508Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:26.4354829Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:26.4355200Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:26.4355509Z     
2025-05-07T20:33:26.4355714Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:26.4355913Z 
2025-05-07T20:33:26.4356017Z moe/activation_test.py:126: 
2025-05-07T20:33:26.4356313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.4356660Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:26.4357054Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:26.4357877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:26.4358669Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:26.4359244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.4359957Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.4360675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:26.4361433Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:26.4362202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:26.4362872Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:26.4363499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:26.4364041Z     fn()
2025-05-07T20:33:26.4364575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:26.4365184Z     self.fn.run(
2025-05-07T20:33:26.4365668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.4366231Z     kernel = self.compile(
2025-05-07T20:33:26.4366801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.4367485Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.4367894Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.4368133Z 
2025-05-07T20:33:26.4368352Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e5d41820>
2025-05-07T20:33:26.4369482Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.4370908Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e52b9580>}
2025-05-07T20:33:26.4372377Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.4373515Z context = <triton._C.libtriton.ir.context object at 0x7f08e58e9c70>
2025-05-07T20:33:26.4373816Z 
2025-05-07T20:33:26.4373994Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.4374663Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.4375198Z                            module_map=module_map)
2025-05-07T20:33:26.4375578Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.4375961Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:26.4376237Z E       ^
2025-05-07T20:33:26.4384090Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.4384618Z 
2025-05-07T20:33:26.4385079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:26.4385634Z 
2025-05-07T20:33:26.4385751Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:26.4386183Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:26.4386615Z     T=1,
2025-05-07T20:33:26.4386809Z     D=5120,
2025-05-07T20:33:26.4387076Z     scale_ub=1200.0,
2025-05-07T20:33:26.4387316Z     contiguous=True,
2025-05-07T20:33:26.4387559Z     compiled=True,
2025-05-07T20:33:26.4387767Z )
2025-05-07T20:33:26.5816819Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.5817391Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:26.5817678Z 
2025-05-07T20:33:26.5817766Z     @given(
2025-05-07T20:33:26.5818014Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.5818349Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.5818673Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.5819022Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.5819361Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.5819667Z     )
2025-05-07T20:33:26.5820025Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.5820488Z     def test_silu_mul_quant(
2025-05-07T20:33:26.5820738Z         self,
2025-05-07T20:33:26.5820938Z         T: int,
2025-05-07T20:33:26.5821135Z         D: int,
2025-05-07T20:33:26.5821356Z         scale_ub: Optional[float],
2025-05-07T20:33:26.5821640Z         contiguous: bool,
2025-05-07T20:33:26.5821888Z         compiled: bool,
2025-05-07T20:33:26.5822106Z     ) -> None:
2025-05-07T20:33:26.5822322Z         torch.manual_seed(2025)
2025-05-07T20:33:26.5822565Z     
2025-05-07T20:33:26.5822841Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.5823197Z     
2025-05-07T20:33:26.5823393Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.5823682Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.5824006Z         x = x_sign * x_clamp
2025-05-07T20:33:26.5824252Z         x0 = x[:, :D]
2025-05-07T20:33:26.5824465Z         x1 = x[:, D:]
2025-05-07T20:33:26.5824672Z     
2025-05-07T20:33:26.5824862Z         if contiguous:
2025-05-07T20:33:26.5825091Z             x0 = x0.contiguous()
2025-05-07T20:33:26.5825362Z             x1 = x1.contiguous()
2025-05-07T20:33:26.5825772Z     
2025-05-07T20:33:26.5825962Z         if scale_ub is not None:
2025-05-07T20:33:26.5826243Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.5826586Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.5826901Z             )
2025-05-07T20:33:26.5827084Z         else:
2025-05-07T20:33:26.5827297Z             scale_ub_tensor = None
2025-05-07T20:33:26.5827672Z     
2025-05-07T20:33:26.5827901Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.5828223Z             op = silu_mul_quant
2025-05-07T20:33:26.5828480Z             if compiled:
2025-05-07T20:33:26.5828721Z                 op = torch.compile(op)
2025-05-07T20:33:26.5829112Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.5829398Z     
2025-05-07T20:33:26.5829593Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:26.5829767Z 
2025-05-07T20:33:26.5829867Z moe/activation_test.py:117: 
2025-05-07T20:33:26.5830177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.5830585Z moe/activation_test.py:115: in fn
2025-05-07T20:33:26.5830879Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.5831470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:26.5832072Z     return fn(*args, **kwargs)
2025-05-07T20:33:26.5832769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:26.5833500Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:26.5834065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.5834791Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.5835540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.5836116Z     kernel = self.compile(
2025-05-07T20:33:26.5836695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.5837389Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.5837819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.5838075Z 
2025-05-07T20:33:26.5838294Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e5963a10>
2025-05-07T20:33:26.5839443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.5840877Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e500c680>}
2025-05-07T20:33:26.5842302Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.5843396Z context = <triton._C.libtriton.ir.context object at 0x7f08e4cb4e70>
2025-05-07T20:33:26.5843700Z 
2025-05-07T20:33:26.5843877Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.5844428Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.5844919Z                            module_map=module_map)
2025-05-07T20:33:26.5845310Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.5845682Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:26.5845949Z E       ^
2025-05-07T20:33:26.5846446Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.5846926Z 
2025-05-07T20:33:26.5847378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:26.5847922Z 
2025-05-07T20:33:26.5848039Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:26.5848463Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:26.5848888Z     T=1,
2025-05-07T20:33:26.5849138Z     D=5120,
2025-05-07T20:33:26.5849335Z     scale_ub=None,
2025-05-07T20:33:26.5849567Z     contiguous=False,
2025-05-07T20:33:26.5849805Z     compiled=True,
2025-05-07T20:33:26.5850007Z )
2025-05-07T20:33:26.8129050Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.8129695Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:26.8130012Z 
2025-05-07T20:33:26.8130093Z     @given(
2025-05-07T20:33:26.8130337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.8130733Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.8131041Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.8131379Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.8131718Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.8132002Z     )
2025-05-07T20:33:26.8132363Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.8132826Z     def test_silu_mul_quant(
2025-05-07T20:33:26.8133063Z         self,
2025-05-07T20:33:26.8133263Z         T: int,
2025-05-07T20:33:26.8133465Z         D: int,
2025-05-07T20:33:26.8133676Z         scale_ub: Optional[float],
2025-05-07T20:33:26.8133956Z         contiguous: bool,
2025-05-07T20:33:26.8134196Z         compiled: bool,
2025-05-07T20:33:26.8134545Z     ) -> None:
2025-05-07T20:33:26.8134762Z         torch.manual_seed(2025)
2025-05-07T20:33:26.8135008Z     
2025-05-07T20:33:26.8135351Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.8135703Z     
2025-05-07T20:33:26.8135907Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.8136204Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.8136515Z         x = x_sign * x_clamp
2025-05-07T20:33:26.8136754Z         x0 = x[:, :D]
2025-05-07T20:33:26.8136977Z         x1 = x[:, D:]
2025-05-07T20:33:26.8137184Z     
2025-05-07T20:33:26.8137377Z         if contiguous:
2025-05-07T20:33:26.8137620Z             x0 = x0.contiguous()
2025-05-07T20:33:26.8137876Z             x1 = x1.contiguous()
2025-05-07T20:33:26.8138128Z     
2025-05-07T20:33:26.8138322Z         if scale_ub is not None:
2025-05-07T20:33:26.8138593Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.8138936Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.8139255Z             )
2025-05-07T20:33:26.8139446Z         else:
2025-05-07T20:33:26.8139662Z             scale_ub_tensor = None
2025-05-07T20:33:26.8139923Z     
2025-05-07T20:33:26.8140173Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.8140493Z             op = silu_mul_quant
2025-05-07T20:33:26.8140747Z             if compiled:
2025-05-07T20:33:26.8140997Z                 op = torch.compile(op)
2025-05-07T20:33:26.8141294Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.8141582Z     
2025-05-07T20:33:26.8141780Z         y_fp8, y_scale = fn()
2025-05-07T20:33:26.8142072Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:26.8142374Z     
2025-05-07T20:33:26.8142617Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.8142952Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:26.8143256Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:26.8143580Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:26.8143940Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:26.8144268Z     
2025-05-07T20:33:26.8144475Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:26.8144682Z 
2025-05-07T20:33:26.8144798Z moe/activation_test.py:126: 
2025-05-07T20:33:26.8145086Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.8145432Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:26.8145776Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:26.8146594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:26.8147461Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:26.8148036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.8148799Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.8149525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:26.8150326Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:26.8151097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:26.8151769Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:26.8152396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:26.8152958Z     fn()
2025-05-07T20:33:26.8153521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:26.8154159Z     self.fn.run(
2025-05-07T20:33:26.8154652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.8155223Z     kernel = self.compile(
2025-05-07T20:33:26.8155840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.8156529Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.8156944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.8157183Z 
2025-05-07T20:33:26.8157412Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e52ee900>
2025-05-07T20:33:26.8158547Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.8159983Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e500eb60>}
2025-05-07T20:33:26.8161417Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.8162510Z context = <triton._C.libtriton.ir.context object at 0x7f08e4c619f0>
2025-05-07T20:33:26.8162812Z 
2025-05-07T20:33:26.8163009Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.8163579Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.8164074Z                            module_map=module_map)
2025-05-07T20:33:26.8164452Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.8164822Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:26.8165099Z E       ^
2025-05-07T20:33:26.8165583Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.8166061Z 
2025-05-07T20:33:26.8166504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:26.8167049Z 
2025-05-07T20:33:26.8167160Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:26.8167585Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:26.8168008Z     T=1,
2025-05-07T20:33:26.8168198Z     D=5120,
2025-05-07T20:33:26.8168392Z     scale_ub=None,
2025-05-07T20:33:26.8168612Z     contiguous=True,
2025-05-07T20:33:26.8168896Z     compiled=False,
2025-05-07T20:33:26.8169100Z )
2025-05-07T20:33:26.9680235Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.9681273Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:26.9681884Z 
2025-05-07T20:33:26.9682053Z     @given(
2025-05-07T20:33:26.9682789Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.9683449Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.9683887Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.9684234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.9684641Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.9684928Z     )
2025-05-07T20:33:26.9685286Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.9685746Z     def test_silu_mul_quant(
2025-05-07T20:33:26.9685988Z         self,
2025-05-07T20:33:26.9686184Z         T: int,
2025-05-07T20:33:26.9686386Z         D: int,
2025-05-07T20:33:26.9686609Z         scale_ub: Optional[float],
2025-05-07T20:33:26.9686889Z         contiguous: bool,
2025-05-07T20:33:26.9687134Z         compiled: bool,
2025-05-07T20:33:26.9687353Z     ) -> None:
2025-05-07T20:33:26.9687575Z         torch.manual_seed(2025)
2025-05-07T20:33:26.9687822Z     
2025-05-07T20:33:26.9688100Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.9688452Z     
2025-05-07T20:33:26.9688711Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.9689009Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.9689329Z         x = x_sign * x_clamp
2025-05-07T20:33:26.9689574Z         x0 = x[:, :D]
2025-05-07T20:33:26.9689794Z         x1 = x[:, D:]
2025-05-07T20:33:26.9689998Z     
2025-05-07T20:33:26.9690185Z         if contiguous:
2025-05-07T20:33:26.9690420Z             x0 = x0.contiguous()
2025-05-07T20:33:26.9690678Z             x1 = x1.contiguous()
2025-05-07T20:33:26.9690923Z     
2025-05-07T20:33:26.9691119Z         if scale_ub is not None:
2025-05-07T20:33:26.9691392Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.9691732Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.9692051Z             )
2025-05-07T20:33:26.9692240Z         else:
2025-05-07T20:33:26.9692457Z             scale_ub_tensor = None
2025-05-07T20:33:26.9692715Z     
2025-05-07T20:33:26.9692954Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.9693277Z             op = silu_mul_quant
2025-05-07T20:33:26.9693532Z             if compiled:
2025-05-07T20:33:26.9693788Z                 op = torch.compile(op)
2025-05-07T20:33:26.9694087Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.9694368Z     
2025-05-07T20:33:26.9694685Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:26.9694857Z 
2025-05-07T20:33:26.9694961Z moe/activation_test.py:117: 
2025-05-07T20:33:26.9695266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.9695622Z moe/activation_test.py:115: in fn
2025-05-07T20:33:26.9695911Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.9696639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:26.9697368Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:26.9697933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.9698646Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.9699346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.9699911Z     kernel = self.compile(
2025-05-07T20:33:26.9700478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.9701161Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.9701651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.9701890Z 
2025-05-07T20:33:26.9702109Z self = <triton.compiler.compiler.ASTSource object at 0x7f099a59ec60>
2025-05-07T20:33:26.9703279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.9704753Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e500f9c0>}
2025-05-07T20:33:26.9706172Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.9707274Z context = <triton._C.libtriton.ir.context object at 0x7f08e4c7a730>
2025-05-07T20:33:26.9707578Z 
2025-05-07T20:33:26.9707758Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.9708299Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.9708792Z                            module_map=module_map)
2025-05-07T20:33:26.9709173Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.9709579Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:26.9709857Z E       ^
2025-05-07T20:33:26.9710342Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.9710816Z 
2025-05-07T20:33:26.9711259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:26.9711801Z 
2025-05-07T20:33:26.9711910Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:26.9712341Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:26.9712763Z     T=128,
2025-05-07T20:33:26.9712963Z     D=5120,
2025-05-07T20:33:26.9713152Z     scale_ub=None,
2025-05-07T20:33:26.9713369Z     contiguous=False,
2025-05-07T20:33:26.9713600Z     compiled=True,
2025-05-07T20:33:26.9713800Z )
2025-05-07T20:33:26.9714123Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:26.9714637Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:26.9714919Z 
2025-05-07T20:33:26.9714998Z     @given(
2025-05-07T20:33:26.9715233Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:26.9715554Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:26.9715864Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:26.9716200Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:26.9716542Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:26.9716838Z     )
2025-05-07T20:33:26.9717186Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:26.9717637Z     def test_silu_mul_quant(
2025-05-07T20:33:26.9717883Z         self,
2025-05-07T20:33:26.9718076Z         T: int,
2025-05-07T20:33:26.9718283Z         D: int,
2025-05-07T20:33:26.9718504Z         scale_ub: Optional[float],
2025-05-07T20:33:26.9718776Z         contiguous: bool,
2025-05-07T20:33:26.9719022Z         compiled: bool,
2025-05-07T20:33:26.9719247Z     ) -> None:
2025-05-07T20:33:26.9719463Z         torch.manual_seed(2025)
2025-05-07T20:33:26.9719709Z     
2025-05-07T20:33:26.9719986Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:26.9720331Z     
2025-05-07T20:33:26.9720526Z         x_sign = torch.sign(x)
2025-05-07T20:33:26.9720818Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:26.9721126Z         x = x_sign * x_clamp
2025-05-07T20:33:26.9721416Z         x0 = x[:, :D]
2025-05-07T20:33:26.9721635Z         x1 = x[:, D:]
2025-05-07T20:33:26.9721838Z     
2025-05-07T20:33:26.9722026Z         if contiguous:
2025-05-07T20:33:26.9722259Z             x0 = x0.contiguous()
2025-05-07T20:33:26.9722521Z             x1 = x1.contiguous()
2025-05-07T20:33:26.9722803Z     
2025-05-07T20:33:26.9722996Z         if scale_ub is not None:
2025-05-07T20:33:26.9723272Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:26.9723607Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:26.9724018Z             )
2025-05-07T20:33:26.9724220Z         else:
2025-05-07T20:33:26.9724435Z             scale_ub_tensor = None
2025-05-07T20:33:26.9724696Z     
2025-05-07T20:33:26.9724931Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:26.9725244Z             op = silu_mul_quant
2025-05-07T20:33:26.9725672Z             if compiled:
2025-05-07T20:33:26.9725923Z                 op = torch.compile(op)
2025-05-07T20:33:26.9726225Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.9726510Z     
2025-05-07T20:33:26.9726706Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:26.9726872Z 
2025-05-07T20:33:26.9726979Z moe/activation_test.py:117: 
2025-05-07T20:33:26.9727283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.9727625Z moe/activation_test.py:115: in fn
2025-05-07T20:33:26.9727916Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:26.9728560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:26.9729158Z     return fn(*args, **kwargs)
2025-05-07T20:33:26.9729854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:26.9730590Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:26.9731152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:26.9731878Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:26.9732584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:26.9733148Z     kernel = self.compile(
2025-05-07T20:33:26.9733718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:26.9734460Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:26.9734879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:26.9735117Z 
2025-05-07T20:33:26.9735327Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e5044c50>
2025-05-07T20:33:26.9736453Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:26.9737885Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e500ca40>}
2025-05-07T20:33:26.9739297Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:26.9740377Z context = <triton._C.libtriton.ir.context object at 0x7f08e4b4ae70>
2025-05-07T20:33:26.9740687Z 
2025-05-07T20:33:26.9740859Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:26.9741397Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:26.9741875Z                            module_map=module_map)
2025-05-07T20:33:26.9742241Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:26.9742671Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:26.9742935Z E       ^
2025-05-07T20:33:26.9743411Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:26.9743892Z 
2025-05-07T20:33:26.9744397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:26.9744946Z 
2025-05-07T20:33:26.9745054Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:26.9745476Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:26.9745947Z     T=128,
2025-05-07T20:33:26.9746136Z     D=7168,
2025-05-07T20:33:26.9746334Z     scale_ub=1200.0,
2025-05-07T20:33:26.9746553Z     contiguous=False,
2025-05-07T20:33:26.9746785Z     compiled=False,
2025-05-07T20:33:26.9746997Z )
2025-05-07T20:33:27.0876847Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.0878120Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:27.0878813Z 
2025-05-07T20:33:27.0878990Z     @given(
2025-05-07T20:33:27.0879458Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.0880118Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.0880737Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.0881395Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.0882312Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.0882899Z     )
2025-05-07T20:33:27.0883433Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.0883923Z     def test_silu_mul_quant(
2025-05-07T20:33:27.0884169Z         self,
2025-05-07T20:33:27.0884364Z         T: int,
2025-05-07T20:33:27.0884564Z         D: int,
2025-05-07T20:33:27.0891441Z         scale_ub: Optional[float],
2025-05-07T20:33:27.0891762Z         contiguous: bool,
2025-05-07T20:33:27.0892013Z         compiled: bool,
2025-05-07T20:33:27.0892245Z     ) -> None:
2025-05-07T20:33:27.0892475Z         torch.manual_seed(2025)
2025-05-07T20:33:27.0892718Z     
2025-05-07T20:33:27.0893007Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.0893363Z     
2025-05-07T20:33:27.0893552Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.0893848Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.0894170Z         x = x_sign * x_clamp
2025-05-07T20:33:27.0894512Z         x0 = x[:, :D]
2025-05-07T20:33:27.0894742Z         x1 = x[:, D:]
2025-05-07T20:33:27.0894957Z     
2025-05-07T20:33:27.0895141Z         if contiguous:
2025-05-07T20:33:27.0895377Z             x0 = x0.contiguous()
2025-05-07T20:33:27.0895643Z             x1 = x1.contiguous()
2025-05-07T20:33:27.0895891Z     
2025-05-07T20:33:27.0896082Z         if scale_ub is not None:
2025-05-07T20:33:27.0896361Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.0896704Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.0897019Z             )
2025-05-07T20:33:27.0897217Z         else:
2025-05-07T20:33:27.0897434Z             scale_ub_tensor = None
2025-05-07T20:33:27.0897690Z     
2025-05-07T20:33:27.0897930Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.0898254Z             op = silu_mul_quant
2025-05-07T20:33:27.0898503Z             if compiled:
2025-05-07T20:33:27.0898761Z                 op = torch.compile(op)
2025-05-07T20:33:27.0899064Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.0899347Z     
2025-05-07T20:33:27.0899546Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.0899716Z 
2025-05-07T20:33:27.0899820Z moe/activation_test.py:117: 
2025-05-07T20:33:27.0900123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.0900467Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.0900755Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.0901590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.0902321Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.0902944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.0903672Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.0904382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.0905004Z     kernel = self.compile(
2025-05-07T20:33:27.0905578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.0906270Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.0906687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.0906935Z 
2025-05-07T20:33:27.0908663Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e52eec90>
2025-05-07T20:33:27.0909804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.0911279Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e5d34540>}
2025-05-07T20:33:27.0912702Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.0913840Z context = <triton._C.libtriton.ir.context object at 0x7f08e4b462b0>
2025-05-07T20:33:27.0914146Z 
2025-05-07T20:33:27.0914322Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.0914870Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.0915361Z                            module_map=module_map)
2025-05-07T20:33:27.0915742Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.0916113Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.0916388Z E       ^
2025-05-07T20:33:27.0916875Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.0917357Z 
2025-05-07T20:33:27.0917799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.0918352Z 
2025-05-07T20:33:27.0918458Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.0918889Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.0919311Z     T=128,
2025-05-07T20:33:27.0919509Z     D=5120,
2025-05-07T20:33:27.0919713Z     scale_ub=None,
2025-05-07T20:33:27.0919937Z     contiguous=False,
2025-05-07T20:33:27.0920181Z     compiled=False,
2025-05-07T20:33:27.0920396Z )
2025-05-07T20:33:27.0920733Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.0921251Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:27.0921537Z 
2025-05-07T20:33:27.0921628Z     @given(
2025-05-07T20:33:27.0921868Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.0922192Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.0922515Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.0922862Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.0923203Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.0923497Z     )
2025-05-07T20:33:27.0923891Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.0924421Z     def test_silu_mul_quant(
2025-05-07T20:33:27.0924679Z         self,
2025-05-07T20:33:27.0924888Z         T: int,
2025-05-07T20:33:27.0925089Z         D: int,
2025-05-07T20:33:27.0925317Z         scale_ub: Optional[float],
2025-05-07T20:33:27.0925844Z         contiguous: bool,
2025-05-07T20:33:27.0926089Z         compiled: bool,
2025-05-07T20:33:27.0926318Z     ) -> None:
2025-05-07T20:33:27.0926538Z         torch.manual_seed(2025)
2025-05-07T20:33:27.0926778Z     
2025-05-07T20:33:27.0927057Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.0927479Z     
2025-05-07T20:33:27.0927676Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.0927963Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.0928276Z         x = x_sign * x_clamp
2025-05-07T20:33:27.0928514Z         x0 = x[:, :D]
2025-05-07T20:33:27.0928727Z         x1 = x[:, D:]
2025-05-07T20:33:27.0928939Z     
2025-05-07T20:33:27.0929124Z         if contiguous:
2025-05-07T20:33:27.0929352Z             x0 = x0.contiguous()
2025-05-07T20:33:27.0929615Z             x1 = x1.contiguous()
2025-05-07T20:33:27.0929858Z     
2025-05-07T20:33:27.0930045Z         if scale_ub is not None:
2025-05-07T20:33:27.0930325Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.0930664Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.0930977Z             )
2025-05-07T20:33:27.0931180Z         else:
2025-05-07T20:33:27.0931455Z             scale_ub_tensor = None
2025-05-07T20:33:27.0931709Z     
2025-05-07T20:33:27.0931943Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.0932270Z             op = silu_mul_quant
2025-05-07T20:33:27.0932525Z             if compiled:
2025-05-07T20:33:27.0932771Z                 op = torch.compile(op)
2025-05-07T20:33:27.0933071Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.0933359Z     
2025-05-07T20:33:27.0933552Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.0933722Z 
2025-05-07T20:33:27.0933820Z moe/activation_test.py:117: 
2025-05-07T20:33:27.0934118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.0934526Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.0934816Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.0935535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.0936263Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.0936819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.0937534Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.0938230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.0938786Z     kernel = self.compile(
2025-05-07T20:33:27.0939346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.0940031Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.0940444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.0940681Z 
2025-05-07T20:33:27.0940890Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4c75c10>
2025-05-07T20:33:27.0942013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.0943439Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e50c0400>}
2025-05-07T20:33:27.0944850Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.0946003Z context = <triton._C.libtriton.ir.context object at 0x7f08e4a10070>
2025-05-07T20:33:27.0946305Z 
2025-05-07T20:33:27.0946519Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.0947072Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.0947566Z                            module_map=module_map)
2025-05-07T20:33:27.0947981Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.0948351Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.0948627Z E       ^
2025-05-07T20:33:27.0949115Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.0949588Z 
2025-05-07T20:33:27.0950033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.0950586Z 
2025-05-07T20:33:27.0950695Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.0951124Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.0951542Z     T=128,
2025-05-07T20:33:27.0951744Z     D=5120,
2025-05-07T20:33:27.0951948Z     scale_ub=1200.0,
2025-05-07T20:33:27.0952173Z     contiguous=True,
2025-05-07T20:33:27.0952455Z     compiled=False,
2025-05-07T20:33:27.0952673Z )
2025-05-07T20:33:27.2680730Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.2682039Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:27.2682653Z 
2025-05-07T20:33:27.2682822Z     @given(
2025-05-07T20:33:27.2683140Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.2683505Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.2683827Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.2684159Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.2684487Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.2684780Z     )
2025-05-07T20:33:27.2685138Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.2685587Z     def test_silu_mul_quant(
2025-05-07T20:33:27.2685826Z         self,
2025-05-07T20:33:27.2686025Z         T: int,
2025-05-07T20:33:27.2686212Z         D: int,
2025-05-07T20:33:27.2686435Z         scale_ub: Optional[float],
2025-05-07T20:33:27.2686706Z         contiguous: bool,
2025-05-07T20:33:27.2686936Z         compiled: bool,
2025-05-07T20:33:27.2687156Z     ) -> None:
2025-05-07T20:33:27.2687367Z         torch.manual_seed(2025)
2025-05-07T20:33:27.2687597Z     
2025-05-07T20:33:27.2687867Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.2688216Z     
2025-05-07T20:33:27.2688409Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.2688694Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.2689003Z         x = x_sign * x_clamp
2025-05-07T20:33:27.2689239Z         x0 = x[:, :D]
2025-05-07T20:33:27.2689455Z         x1 = x[:, D:]
2025-05-07T20:33:27.2689654Z     
2025-05-07T20:33:27.2689838Z         if contiguous:
2025-05-07T20:33:27.2690067Z             x0 = x0.contiguous()
2025-05-07T20:33:27.2690316Z             x1 = x1.contiguous()
2025-05-07T20:33:27.2690560Z     
2025-05-07T20:33:27.2690748Z         if scale_ub is not None:
2025-05-07T20:33:27.2691015Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.2691350Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.2691659Z             )
2025-05-07T20:33:27.2691848Z         else:
2025-05-07T20:33:27.2692049Z             scale_ub_tensor = None
2025-05-07T20:33:27.2692299Z     
2025-05-07T20:33:27.2692524Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.2692969Z             op = silu_mul_quant
2025-05-07T20:33:27.2693221Z             if compiled:
2025-05-07T20:33:27.2693479Z                 op = torch.compile(op)
2025-05-07T20:33:27.2693774Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.2694058Z     
2025-05-07T20:33:27.2694316Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.2694587Z 
2025-05-07T20:33:27.2694689Z moe/activation_test.py:117: 
2025-05-07T20:33:27.2694994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.2695433Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.2695724Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.2696440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.2697167Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.2697728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.2698448Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.2699148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.2699715Z     kernel = self.compile(
2025-05-07T20:33:27.2700283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.2701032Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.2701449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.2701685Z 
2025-05-07T20:33:27.2701905Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4c77920>
2025-05-07T20:33:27.2703031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.2704456Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e50c1300>}
2025-05-07T20:33:27.2705870Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.2706964Z context = <triton._C.libtriton.ir.context object at 0x7f099a47e9f0>
2025-05-07T20:33:27.2707269Z 
2025-05-07T20:33:27.2707447Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.2707981Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.2708468Z                            module_map=module_map)
2025-05-07T20:33:27.2708842Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.2709206Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.2709472Z E       ^
2025-05-07T20:33:27.2709950Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.2710419Z 
2025-05-07T20:33:27.2710865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.2711407Z 
2025-05-07T20:33:27.2711518Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.2711941Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.2712361Z     T=1,
2025-05-07T20:33:27.2712550Z     D=7168,
2025-05-07T20:33:27.2712740Z     scale_ub=1200.0,
2025-05-07T20:33:27.2712973Z     contiguous=True,
2025-05-07T20:33:27.2713199Z     compiled=True,
2025-05-07T20:33:27.2713426Z )
2025-05-07T20:33:27.2713772Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.2714332Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:27.2714606Z 
2025-05-07T20:33:27.2714682Z     @given(
2025-05-07T20:33:27.2714918Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.2715279Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.2715593Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.2715925Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.2716267Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.2716607Z     )
2025-05-07T20:33:27.2716964Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.2717424Z     def test_silu_mul_quant(
2025-05-07T20:33:27.2717674Z         self,
2025-05-07T20:33:27.2717866Z         T: int,
2025-05-07T20:33:27.2718060Z         D: int,
2025-05-07T20:33:27.2718274Z         scale_ub: Optional[float],
2025-05-07T20:33:27.2718546Z         contiguous: bool,
2025-05-07T20:33:27.2718788Z         compiled: bool,
2025-05-07T20:33:27.2719007Z     ) -> None:
2025-05-07T20:33:27.2719210Z         torch.manual_seed(2025)
2025-05-07T20:33:27.2719446Z     
2025-05-07T20:33:27.2719722Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.2720069Z     
2025-05-07T20:33:27.2720262Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.2720547Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.2720900Z         x = x_sign * x_clamp
2025-05-07T20:33:27.2721133Z         x0 = x[:, :D]
2025-05-07T20:33:27.2721352Z         x1 = x[:, D:]
2025-05-07T20:33:27.2721561Z     
2025-05-07T20:33:27.2721737Z         if contiguous:
2025-05-07T20:33:27.2721966Z             x0 = x0.contiguous()
2025-05-07T20:33:27.2722223Z             x1 = x1.contiguous()
2025-05-07T20:33:27.2722461Z     
2025-05-07T20:33:27.2722647Z         if scale_ub is not None:
2025-05-07T20:33:27.2722920Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.2723267Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.2723611Z             )
2025-05-07T20:33:27.2723798Z         else:
2025-05-07T20:33:27.2724001Z             scale_ub_tensor = None
2025-05-07T20:33:27.2724254Z     
2025-05-07T20:33:27.2724490Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.2724802Z             op = silu_mul_quant
2025-05-07T20:33:27.2725054Z             if compiled:
2025-05-07T20:33:27.2725300Z                 op = torch.compile(op)
2025-05-07T20:33:27.2725786Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.2726065Z     
2025-05-07T20:33:27.2726252Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.2726415Z 
2025-05-07T20:33:27.2726517Z moe/activation_test.py:117: 
2025-05-07T20:33:27.2726809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.2727141Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.2727421Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.2727992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:27.2728571Z     return fn(*args, **kwargs)
2025-05-07T20:33:27.2729259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.2729985Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.2730540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.2731254Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.2731946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.2732497Z     kernel = self.compile(
2025-05-07T20:33:27.2733055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.2733815Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.2734226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.2734532Z 
2025-05-07T20:33:27.2734804Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4c76960>
2025-05-07T20:33:27.2735934Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.2737416Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e50c2ac0>}
2025-05-07T20:33:27.2738825Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.2739913Z context = <triton._C.libtriton.ir.context object at 0x7f08e4aeddf0>
2025-05-07T20:33:27.2740210Z 
2025-05-07T20:33:27.2740379Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.2740924Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.2741416Z                            module_map=module_map)
2025-05-07T20:33:27.2741840Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.2742213Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.2742480Z E       ^
2025-05-07T20:33:27.2742966Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.2743439Z 
2025-05-07T20:33:27.2743875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.2744422Z 
2025-05-07T20:33:27.2744526Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.2744954Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.2745369Z     T=1,
2025-05-07T20:33:27.2745548Z     D=7168,
2025-05-07T20:33:27.2745733Z     scale_ub=1200.0,
2025-05-07T20:33:27.2745956Z     contiguous=False,
2025-05-07T20:33:27.2746169Z     compiled=True,
2025-05-07T20:33:27.2746367Z )
2025-05-07T20:33:27.4095588Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.4096345Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:27.4096728Z 
2025-05-07T20:33:27.4096839Z     @given(
2025-05-07T20:33:27.4097113Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.4097433Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.4097737Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.4098071Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.4098410Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.4098704Z     )
2025-05-07T20:33:27.4099055Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.4099522Z     def test_silu_mul_quant(
2025-05-07T20:33:27.4099770Z         self,
2025-05-07T20:33:27.4099961Z         T: int,
2025-05-07T20:33:27.4100154Z         D: int,
2025-05-07T20:33:27.4100375Z         scale_ub: Optional[float],
2025-05-07T20:33:27.4100644Z         contiguous: bool,
2025-05-07T20:33:27.4100883Z         compiled: bool,
2025-05-07T20:33:27.4101114Z     ) -> None:
2025-05-07T20:33:27.4101323Z         torch.manual_seed(2025)
2025-05-07T20:33:27.4101566Z     
2025-05-07T20:33:27.4101844Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.4102194Z     
2025-05-07T20:33:27.4102389Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.4102682Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.4103106Z         x = x_sign * x_clamp
2025-05-07T20:33:27.4103387Z         x0 = x[:, :D]
2025-05-07T20:33:27.4103602Z         x1 = x[:, D:]
2025-05-07T20:33:27.4103812Z     
2025-05-07T20:33:27.4103991Z         if contiguous:
2025-05-07T20:33:27.4104223Z             x0 = x0.contiguous()
2025-05-07T20:33:27.4104546Z             x1 = x1.contiguous()
2025-05-07T20:33:27.4104784Z     
2025-05-07T20:33:27.4104973Z         if scale_ub is not None:
2025-05-07T20:33:27.4105254Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.4105589Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.4105964Z             )
2025-05-07T20:33:27.4106157Z         else:
2025-05-07T20:33:27.4106361Z             scale_ub_tensor = None
2025-05-07T20:33:27.4106615Z     
2025-05-07T20:33:27.4106845Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.4107156Z             op = silu_mul_quant
2025-05-07T20:33:27.4107405Z             if compiled:
2025-05-07T20:33:27.4107667Z                 op = torch.compile(op)
2025-05-07T20:33:27.4107960Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.4108241Z     
2025-05-07T20:33:27.4108427Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.4108593Z 
2025-05-07T20:33:27.4108694Z moe/activation_test.py:117: 
2025-05-07T20:33:27.4108987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.4109325Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.4109671Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.4110250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:27.4110835Z     return fn(*args, **kwargs)
2025-05-07T20:33:27.4111526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.4112246Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.4112806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.4113519Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.4114219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.4114773Z     kernel = self.compile(
2025-05-07T20:33:27.4122071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.4122788Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.4123211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.4123455Z 
2025-05-07T20:33:27.4123667Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4c77f80>
2025-05-07T20:33:27.4124797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.4126543Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e50039c0>}
2025-05-07T20:33:27.4127959Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.4129045Z context = <triton._C.libtriton.ir.context object at 0x7f099a409d70>
2025-05-07T20:33:27.4129352Z 
2025-05-07T20:33:27.4129526Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.4130064Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.4130556Z                            module_map=module_map)
2025-05-07T20:33:27.4131041Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.4131407Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.4131672Z E       ^
2025-05-07T20:33:27.4132213Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.4132694Z 
2025-05-07T20:33:27.4133136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.4133685Z 
2025-05-07T20:33:27.4133850Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.4134272Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.4134786Z     T=1,
2025-05-07T20:33:27.4134971Z     D=7168,
2025-05-07T20:33:27.4135171Z     scale_ub=None,
2025-05-07T20:33:27.4135383Z     contiguous=False,
2025-05-07T20:33:27.4135611Z     compiled=True,
2025-05-07T20:33:27.4135815Z )
2025-05-07T20:33:27.5006085Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.5006844Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:27.5007219Z 
2025-05-07T20:33:27.5007329Z     @given(
2025-05-07T20:33:27.5007561Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.5007899Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.5008218Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.5008732Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.5009081Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.5009380Z     )
2025-05-07T20:33:27.5009731Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.5010187Z     def test_silu_mul_quant(
2025-05-07T20:33:27.5010431Z         self,
2025-05-07T20:33:27.5010625Z         T: int,
2025-05-07T20:33:27.5010811Z         D: int,
2025-05-07T20:33:27.5011029Z         scale_ub: Optional[float],
2025-05-07T20:33:27.5011310Z         contiguous: bool,
2025-05-07T20:33:27.5011541Z         compiled: bool,
2025-05-07T20:33:27.5011764Z     ) -> None:
2025-05-07T20:33:27.5011978Z         torch.manual_seed(2025)
2025-05-07T20:33:27.5012207Z     
2025-05-07T20:33:27.5012485Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.5012842Z     
2025-05-07T20:33:27.5013028Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.5013355Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.5013694Z         x = x_sign * x_clamp
2025-05-07T20:33:27.5013934Z         x0 = x[:, :D]
2025-05-07T20:33:27.5014153Z         x1 = x[:, D:]
2025-05-07T20:33:27.5014364Z     
2025-05-07T20:33:27.5014662Z         if contiguous:
2025-05-07T20:33:27.5014900Z             x0 = x0.contiguous()
2025-05-07T20:33:27.5015173Z             x1 = x1.contiguous()
2025-05-07T20:33:27.5015407Z     
2025-05-07T20:33:27.5015602Z         if scale_ub is not None:
2025-05-07T20:33:27.5015883Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.5016228Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.5016537Z             )
2025-05-07T20:33:27.5016732Z         else:
2025-05-07T20:33:27.5016947Z             scale_ub_tensor = None
2025-05-07T20:33:27.5017198Z     
2025-05-07T20:33:27.5017440Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.5017767Z             op = silu_mul_quant
2025-05-07T20:33:27.5018018Z             if compiled:
2025-05-07T20:33:27.5018271Z                 op = torch.compile(op)
2025-05-07T20:33:27.5018581Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.5018858Z     
2025-05-07T20:33:27.5019049Z         y_fp8, y_scale = fn()
2025-05-07T20:33:27.5019341Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:27.5019638Z     
2025-05-07T20:33:27.5019875Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.5020222Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:27.5020599Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:27.5020920Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:27.5021295Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:27.5021618Z     
2025-05-07T20:33:27.5021955Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:27.5022160Z 
2025-05-07T20:33:27.5022260Z moe/activation_test.py:126: 
2025-05-07T20:33:27.5022568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.5022965Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:27.5023299Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:27.5024126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:27.5024929Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:27.5025676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.5026400Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.5027129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:27.5027892Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:27.5028725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:27.5029409Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:27.5030044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:27.5030586Z     fn()
2025-05-07T20:33:27.5031117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:27.5031748Z     self.fn.run(
2025-05-07T20:33:27.5032240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.5032798Z     kernel = self.compile(
2025-05-07T20:33:27.5033372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.5034069Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.5034486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.5034741Z 
2025-05-07T20:33:27.5034955Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4a1c830>
2025-05-07T20:33:27.5036101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.5037557Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099a478b80>}
2025-05-07T20:33:27.5038992Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.5040086Z context = <triton._C.libtriton.ir.context object at 0x7f099a442730>
2025-05-07T20:33:27.5040398Z 
2025-05-07T20:33:27.5040578Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.5041129Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.5041622Z                            module_map=module_map)
2025-05-07T20:33:27.5041998Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.5042372Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:27.5042717Z E       ^
2025-05-07T20:33:27.5043187Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.5043670Z 
2025-05-07T20:33:27.5044162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.5044720Z 
2025-05-07T20:33:27.5044821Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.5045246Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.5045716Z     T=1,
2025-05-07T20:33:27.5045896Z     D=5120,
2025-05-07T20:33:27.5046088Z     scale_ub=1200.0,
2025-05-07T20:33:27.5046307Z     contiguous=False,
2025-05-07T20:33:27.5046534Z     compiled=True,
2025-05-07T20:33:27.5046728Z )
2025-05-07T20:33:27.6605820Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.6606624Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:27.6607006Z 
2025-05-07T20:33:27.6607094Z     @given(
2025-05-07T20:33:27.6607351Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.6607798Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.6608223Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.6608675Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.6609011Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.6609311Z     )
2025-05-07T20:33:27.6609774Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.6610234Z     def test_silu_mul_quant(
2025-05-07T20:33:27.6610484Z         self,
2025-05-07T20:33:27.6610675Z         T: int,
2025-05-07T20:33:27.6610870Z         D: int,
2025-05-07T20:33:27.6611082Z         scale_ub: Optional[float],
2025-05-07T20:33:27.6611354Z         contiguous: bool,
2025-05-07T20:33:27.6611596Z         compiled: bool,
2025-05-07T20:33:27.6611816Z     ) -> None:
2025-05-07T20:33:27.6612024Z         torch.manual_seed(2025)
2025-05-07T20:33:27.6612266Z     
2025-05-07T20:33:27.6612540Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.6612886Z     
2025-05-07T20:33:27.6613081Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.6613380Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.6613692Z         x = x_sign * x_clamp
2025-05-07T20:33:27.6613937Z         x0 = x[:, :D]
2025-05-07T20:33:27.6614151Z         x1 = x[:, D:]
2025-05-07T20:33:27.6614355Z     
2025-05-07T20:33:27.6614670Z         if contiguous:
2025-05-07T20:33:27.6614903Z             x0 = x0.contiguous()
2025-05-07T20:33:27.6615160Z             x1 = x1.contiguous()
2025-05-07T20:33:27.6615402Z     
2025-05-07T20:33:27.6615591Z         if scale_ub is not None:
2025-05-07T20:33:27.6615860Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.6616201Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.6616516Z             )
2025-05-07T20:33:27.6616706Z         else:
2025-05-07T20:33:27.6616911Z             scale_ub_tensor = None
2025-05-07T20:33:27.6617166Z     
2025-05-07T20:33:27.6617403Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.6617718Z             op = silu_mul_quant
2025-05-07T20:33:27.6617974Z             if compiled:
2025-05-07T20:33:27.6618231Z                 op = torch.compile(op)
2025-05-07T20:33:27.6618529Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.6618806Z     
2025-05-07T20:33:27.6618998Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.6619163Z 
2025-05-07T20:33:27.6619261Z moe/activation_test.py:117: 
2025-05-07T20:33:27.6619561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.6619904Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.6620189Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.6620770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:27.6621438Z     return fn(*args, **kwargs)
2025-05-07T20:33:27.6622124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.6622915Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.6623485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.6624202Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.6624960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.6625689Z     kernel = self.compile(
2025-05-07T20:33:27.6626258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.6626945Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.6627356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.6627598Z 
2025-05-07T20:33:27.6627811Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4a1cf80>
2025-05-07T20:33:27.6629000Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.6630435Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099a479e40>}
2025-05-07T20:33:27.6631874Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.6632956Z context = <triton._C.libtriton.ir.context object at 0x7f08e4752470>
2025-05-07T20:33:27.6633259Z 
2025-05-07T20:33:27.6633431Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.6633977Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.6634467Z                            module_map=module_map)
2025-05-07T20:33:27.6634835Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.6635206Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.6635478Z E       ^
2025-05-07T20:33:27.6635959Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.6636438Z 
2025-05-07T20:33:27.6636875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.6637421Z 
2025-05-07T20:33:27.6637525Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.6637954Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.6638368Z     T=1,
2025-05-07T20:33:27.6638558Z     D=5120,
2025-05-07T20:33:27.6638749Z     scale_ub=1200.0,
2025-05-07T20:33:27.6638969Z     contiguous=False,
2025-05-07T20:33:27.6639200Z     compiled=False,
2025-05-07T20:33:27.6639404Z )
2025-05-07T20:33:27.6639718Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.6640225Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:27.6640507Z 
2025-05-07T20:33:27.6640581Z     @given(
2025-05-07T20:33:27.6640811Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.6641122Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.6641430Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.6641761Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.6642085Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.6642440Z     )
2025-05-07T20:33:27.6642790Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.6643233Z     def test_silu_mul_quant(
2025-05-07T20:33:27.6643475Z         self,
2025-05-07T20:33:27.6643664Z         T: int,
2025-05-07T20:33:27.6643913Z         D: int,
2025-05-07T20:33:27.6644135Z         scale_ub: Optional[float],
2025-05-07T20:33:27.6644410Z         contiguous: bool,
2025-05-07T20:33:27.6644648Z         compiled: bool,
2025-05-07T20:33:27.6644865Z     ) -> None:
2025-05-07T20:33:27.6645077Z         torch.manual_seed(2025)
2025-05-07T20:33:27.6645409Z     
2025-05-07T20:33:27.6645679Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.6646031Z     
2025-05-07T20:33:27.6646220Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.6646503Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.6646815Z         x = x_sign * x_clamp
2025-05-07T20:33:27.6647053Z         x0 = x[:, :D]
2025-05-07T20:33:27.6647262Z         x1 = x[:, D:]
2025-05-07T20:33:27.6647460Z     
2025-05-07T20:33:27.6647637Z         if contiguous:
2025-05-07T20:33:27.6647858Z             x0 = x0.contiguous()
2025-05-07T20:33:27.6648113Z             x1 = x1.contiguous()
2025-05-07T20:33:27.6648358Z     
2025-05-07T20:33:27.6648547Z         if scale_ub is not None:
2025-05-07T20:33:27.6648819Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.6649203Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.6649515Z             )
2025-05-07T20:33:27.6649709Z         else:
2025-05-07T20:33:27.6649918Z             scale_ub_tensor = None
2025-05-07T20:33:27.6650171Z     
2025-05-07T20:33:27.6650396Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.6650719Z             op = silu_mul_quant
2025-05-07T20:33:27.6650966Z             if compiled:
2025-05-07T20:33:27.6651203Z                 op = torch.compile(op)
2025-05-07T20:33:27.6651506Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.6651787Z     
2025-05-07T20:33:27.6651970Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.6652138Z 
2025-05-07T20:33:27.6652237Z moe/activation_test.py:117: 
2025-05-07T20:33:27.6652529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.6652860Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.6653146Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.6653863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.6654640Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.6655191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.6655903Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.6656597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.6657156Z     kernel = self.compile(
2025-05-07T20:33:27.6657708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.6658388Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.6658799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.6659032Z 
2025-05-07T20:33:27.6659243Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4a1e360>
2025-05-07T20:33:27.6660363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.6661791Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099a47aac0>}
2025-05-07T20:33:27.6663252Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.6664376Z context = <triton._C.libtriton.ir.context object at 0x7f08e4d445f0>
2025-05-07T20:33:27.6664676Z 
2025-05-07T20:33:27.6664851Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.6665394Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.6665950Z                            module_map=module_map)
2025-05-07T20:33:27.6666321Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.6666678Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.6666945Z E       ^
2025-05-07T20:33:27.6667424Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.6667894Z 
2025-05-07T20:33:27.6668329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.6668874Z 
2025-05-07T20:33:27.6668977Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.6669399Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.6669818Z     T=16384,
2025-05-07T20:33:27.6670055Z     D=5120,
2025-05-07T20:33:27.6670249Z     scale_ub=1200.0,
2025-05-07T20:33:27.6670474Z     contiguous=False,
2025-05-07T20:33:27.6670694Z     compiled=True,
2025-05-07T20:33:27.6670895Z )
2025-05-07T20:33:27.7544566Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.7545374Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:27.7545775Z 
2025-05-07T20:33:27.7545878Z     @given(
2025-05-07T20:33:27.7546234Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.7546585Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.7546900Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.7547238Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.7547581Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.7547878Z     )
2025-05-07T20:33:27.7548239Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.7548700Z     def test_silu_mul_quant(
2025-05-07T20:33:27.7548949Z         self,
2025-05-07T20:33:27.7549146Z         T: int,
2025-05-07T20:33:27.7549338Z         D: int,
2025-05-07T20:33:27.7549557Z         scale_ub: Optional[float],
2025-05-07T20:33:27.7549832Z         contiguous: bool,
2025-05-07T20:33:27.7550074Z         compiled: bool,
2025-05-07T20:33:27.7550298Z     ) -> None:
2025-05-07T20:33:27.7550516Z         torch.manual_seed(2025)
2025-05-07T20:33:27.7550747Z     
2025-05-07T20:33:27.7551018Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.7551365Z     
2025-05-07T20:33:27.7551550Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.7551844Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.7552160Z         x = x_sign * x_clamp
2025-05-07T20:33:27.7552398Z         x0 = x[:, :D]
2025-05-07T20:33:27.7552614Z         x1 = x[:, D:]
2025-05-07T20:33:27.7552816Z     
2025-05-07T20:33:27.7552996Z         if contiguous:
2025-05-07T20:33:27.7553224Z             x0 = x0.contiguous()
2025-05-07T20:33:27.7553527Z             x1 = x1.contiguous()
2025-05-07T20:33:27.7553774Z     
2025-05-07T20:33:27.7553959Z         if scale_ub is not None:
2025-05-07T20:33:27.7554231Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.7554564Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.7554867Z             )
2025-05-07T20:33:27.7555061Z         else:
2025-05-07T20:33:27.7555272Z             scale_ub_tensor = None
2025-05-07T20:33:27.7555630Z     
2025-05-07T20:33:27.7555857Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.7556174Z             op = silu_mul_quant
2025-05-07T20:33:27.7556425Z             if compiled:
2025-05-07T20:33:27.7556670Z                 op = torch.compile(op)
2025-05-07T20:33:27.7557030Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.7557301Z     
2025-05-07T20:33:27.7557489Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.7557650Z 
2025-05-07T20:33:27.7557760Z moe/activation_test.py:117: 
2025-05-07T20:33:27.7558118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.7558447Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.7558729Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.7559304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:27.7559884Z     return fn(*args, **kwargs)
2025-05-07T20:33:27.7560565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.7561285Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.7561849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.7562555Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.7563302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.7563866Z     kernel = self.compile(
2025-05-07T20:33:27.7564426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.7565114Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.7565524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.7565764Z 
2025-05-07T20:33:27.7565979Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4d00590>
2025-05-07T20:33:27.7567095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.7568527Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4d58180>}
2025-05-07T20:33:27.7569933Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.7571017Z context = <triton._C.libtriton.ir.context object at 0x7f08e4d4d8f0>
2025-05-07T20:33:27.7571314Z 
2025-05-07T20:33:27.7571490Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.7572026Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.7572509Z                            module_map=module_map)
2025-05-07T20:33:27.7572888Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.7573255Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.7573566Z E       ^
2025-05-07T20:33:27.7574048Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.7574663Z 
2025-05-07T20:33:27.7575108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.7575651Z 
2025-05-07T20:33:27.7575758Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.7576190Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.7576663Z     T=2048,
2025-05-07T20:33:27.7576845Z     D=7168,
2025-05-07T20:33:27.7577045Z     scale_ub=1200.0,
2025-05-07T20:33:27.7577276Z     contiguous=False,
2025-05-07T20:33:27.7577506Z     compiled=True,
2025-05-07T20:33:27.7584106Z )
2025-05-07T20:33:27.7584530Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.7585056Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:27.7585347Z 
2025-05-07T20:33:27.7585436Z     @given(
2025-05-07T20:33:27.7585682Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.7586048Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.7586369Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.7586710Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.7587040Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.7587335Z     )
2025-05-07T20:33:27.7587690Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.7588150Z     def test_silu_mul_quant(
2025-05-07T20:33:27.7588402Z         self,
2025-05-07T20:33:27.7588607Z         T: int,
2025-05-07T20:33:27.7588804Z         D: int,
2025-05-07T20:33:27.7589031Z         scale_ub: Optional[float],
2025-05-07T20:33:27.7589317Z         contiguous: bool,
2025-05-07T20:33:27.7589562Z         compiled: bool,
2025-05-07T20:33:27.7589790Z     ) -> None:
2025-05-07T20:33:27.7590015Z         torch.manual_seed(2025)
2025-05-07T20:33:27.7590305Z     
2025-05-07T20:33:27.7590593Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.7590955Z     
2025-05-07T20:33:27.7591158Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.7591453Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.7591780Z         x = x_sign * x_clamp
2025-05-07T20:33:27.7592036Z         x0 = x[:, :D]
2025-05-07T20:33:27.7592256Z         x1 = x[:, D:]
2025-05-07T20:33:27.7592472Z     
2025-05-07T20:33:27.7592666Z         if contiguous:
2025-05-07T20:33:27.7592904Z             x0 = x0.contiguous()
2025-05-07T20:33:27.7593172Z             x1 = x1.contiguous()
2025-05-07T20:33:27.7593419Z     
2025-05-07T20:33:27.7593614Z         if scale_ub is not None:
2025-05-07T20:33:27.7593893Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.7594239Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.7594559Z             )
2025-05-07T20:33:27.7594758Z         else:
2025-05-07T20:33:27.7594976Z             scale_ub_tensor = None
2025-05-07T20:33:27.7595236Z     
2025-05-07T20:33:27.7595472Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.7595802Z             op = silu_mul_quant
2025-05-07T20:33:27.7596065Z             if compiled:
2025-05-07T20:33:27.7596314Z                 op = torch.compile(op)
2025-05-07T20:33:27.7596620Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.7596906Z     
2025-05-07T20:33:27.7597102Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.7597275Z 
2025-05-07T20:33:27.7597378Z moe/activation_test.py:117: 
2025-05-07T20:33:27.7597684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.7598026Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.7598319Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.7598909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:27.7599501Z     return fn(*args, **kwargs)
2025-05-07T20:33:27.7600189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.7600918Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.7601480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.7602191Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.7602942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.7603513Z     kernel = self.compile(
2025-05-07T20:33:27.7604169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.7604860Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.7605272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.7605510Z 
2025-05-07T20:33:27.7605771Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4d02270>
2025-05-07T20:33:27.7606902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.7608331Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4d58ea0>}
2025-05-07T20:33:27.7609747Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.7610837Z context = <triton._C.libtriton.ir.context object at 0x7f08e4d287f0>
2025-05-07T20:33:27.7611138Z 
2025-05-07T20:33:27.7611368Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.7611912Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.7612403Z                            module_map=module_map)
2025-05-07T20:33:27.7612792Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.7613162Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.7613431Z E       ^
2025-05-07T20:33:27.7613950Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.7614503Z 
2025-05-07T20:33:27.7614948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.7615499Z 
2025-05-07T20:33:27.8768132Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.8769297Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.8770369Z     T=1,
2025-05-07T20:33:27.8770831Z     D=5120,
2025-05-07T20:33:27.8771299Z     scale_ub=None,
2025-05-07T20:33:27.8771826Z     contiguous=False,
2025-05-07T20:33:27.8772232Z     compiled=False,
2025-05-07T20:33:27.8772582Z )
2025-05-07T20:33:27.8773164Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.8773861Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:27.8774136Z 
2025-05-07T20:33:27.8774220Z     @given(
2025-05-07T20:33:27.8774510Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.8774824Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.8775131Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.8775458Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.8775797Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.8776085Z     )
2025-05-07T20:33:27.8776432Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.8776884Z     def test_silu_mul_quant(
2025-05-07T20:33:27.8777126Z         self,
2025-05-07T20:33:27.8777318Z         T: int,
2025-05-07T20:33:27.8777499Z         D: int,
2025-05-07T20:33:27.8777715Z         scale_ub: Optional[float],
2025-05-07T20:33:27.8777986Z         contiguous: bool,
2025-05-07T20:33:27.8778221Z         compiled: bool,
2025-05-07T20:33:27.8778443Z     ) -> None:
2025-05-07T20:33:27.8778660Z         torch.manual_seed(2025)
2025-05-07T20:33:27.8779016Z     
2025-05-07T20:33:27.8779298Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.8779661Z     
2025-05-07T20:33:27.8779860Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.8780164Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.8780550Z         x = x_sign * x_clamp
2025-05-07T20:33:27.8780796Z         x0 = x[:, :D]
2025-05-07T20:33:27.8781018Z         x1 = x[:, D:]
2025-05-07T20:33:27.8781227Z     
2025-05-07T20:33:27.8781414Z         if contiguous:
2025-05-07T20:33:27.8781709Z             x0 = x0.contiguous()
2025-05-07T20:33:27.8781978Z             x1 = x1.contiguous()
2025-05-07T20:33:27.8782219Z     
2025-05-07T20:33:27.8782409Z         if scale_ub is not None:
2025-05-07T20:33:27.8782692Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.8783029Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.8783343Z             )
2025-05-07T20:33:27.8783547Z         else:
2025-05-07T20:33:27.8783748Z             scale_ub_tensor = None
2025-05-07T20:33:27.8783991Z     
2025-05-07T20:33:27.8784219Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.8784538Z             op = silu_mul_quant
2025-05-07T20:33:27.8784786Z             if compiled:
2025-05-07T20:33:27.8785041Z                 op = torch.compile(op)
2025-05-07T20:33:27.8785336Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.8785606Z     
2025-05-07T20:33:27.8785859Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.8786024Z 
2025-05-07T20:33:27.8786132Z moe/activation_test.py:117: 
2025-05-07T20:33:27.8786432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.8786771Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.8787049Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.8787798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.8788528Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.8789080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.8789802Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.8790493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.8791044Z     kernel = self.compile(
2025-05-07T20:33:27.8791598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.8792285Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.8792691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.8792923Z 
2025-05-07T20:33:27.8793139Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4d037d0>
2025-05-07T20:33:27.8794256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.8795680Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4d59e40>}
2025-05-07T20:33:27.8797084Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.8798169Z context = <triton._C.libtriton.ir.context object at 0x7f08e48216f0>
2025-05-07T20:33:27.8798469Z 
2025-05-07T20:33:27.8798632Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.8799164Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.8799697Z                            module_map=module_map)
2025-05-07T20:33:27.8800063Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.8800412Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.8800671Z E       ^
2025-05-07T20:33:27.8801192Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.8801664Z 
2025-05-07T20:33:27.8802103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.8802689Z 
2025-05-07T20:33:27.8802794Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.8803216Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.8803685Z     T=4096,
2025-05-07T20:33:27.8803875Z     D=7168,
2025-05-07T20:33:27.8804058Z     scale_ub=1200.0,
2025-05-07T20:33:27.8804289Z     contiguous=False,
2025-05-07T20:33:27.8804507Z     compiled=False,
2025-05-07T20:33:27.8804705Z )
2025-05-07T20:33:27.8805023Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:27.8805529Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:27.8805825Z 
2025-05-07T20:33:27.8805903Z     @given(
2025-05-07T20:33:27.8806133Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:27.8807111Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:27.8807420Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:27.8807763Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:27.8808097Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:27.8808385Z     )
2025-05-07T20:33:27.8808739Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:27.8809194Z     def test_silu_mul_quant(
2025-05-07T20:33:27.8809433Z         self,
2025-05-07T20:33:27.8809630Z         T: int,
2025-05-07T20:33:27.8809821Z         D: int,
2025-05-07T20:33:27.8810033Z         scale_ub: Optional[float],
2025-05-07T20:33:27.8810309Z         contiguous: bool,
2025-05-07T20:33:27.8810546Z         compiled: bool,
2025-05-07T20:33:27.8810756Z     ) -> None:
2025-05-07T20:33:27.8810972Z         torch.manual_seed(2025)
2025-05-07T20:33:27.8811213Z     
2025-05-07T20:33:27.8811483Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:27.8811838Z     
2025-05-07T20:33:27.8812023Z         x_sign = torch.sign(x)
2025-05-07T20:33:27.8812315Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:27.8812622Z         x = x_sign * x_clamp
2025-05-07T20:33:27.8812854Z         x0 = x[:, :D]
2025-05-07T20:33:27.8813062Z         x1 = x[:, D:]
2025-05-07T20:33:27.8813262Z     
2025-05-07T20:33:27.8813439Z         if contiguous:
2025-05-07T20:33:27.8813666Z             x0 = x0.contiguous()
2025-05-07T20:33:27.8813926Z             x1 = x1.contiguous()
2025-05-07T20:33:27.8814158Z     
2025-05-07T20:33:27.8814345Z         if scale_ub is not None:
2025-05-07T20:33:27.8814693Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:27.8815031Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:27.8815347Z             )
2025-05-07T20:33:27.8815542Z         else:
2025-05-07T20:33:27.8815741Z             scale_ub_tensor = None
2025-05-07T20:33:27.8816000Z     
2025-05-07T20:33:27.8816235Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:27.8816545Z             op = silu_mul_quant
2025-05-07T20:33:27.8816797Z             if compiled:
2025-05-07T20:33:27.8817042Z                 op = torch.compile(op)
2025-05-07T20:33:27.8817341Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.8817612Z     
2025-05-07T20:33:27.8817799Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:27.8817963Z 
2025-05-07T20:33:27.8818064Z moe/activation_test.py:117: 
2025-05-07T20:33:27.8818408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.8818743Z moe/activation_test.py:115: in fn
2025-05-07T20:33:27.8819025Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:27.8819777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:27.8820515Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:27.8821073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:27.8821830Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:27.8822526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:27.8823083Z     kernel = self.compile(
2025-05-07T20:33:27.8823652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:27.8824338Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:27.8824745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:27.8824990Z 
2025-05-07T20:33:27.8825206Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4d03410>
2025-05-07T20:33:27.8826559Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:27.8827987Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4d5b380>}
2025-05-07T20:33:27.8829391Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:27.8830477Z context = <triton._C.libtriton.ir.context object at 0x7f08e460b7b0>
2025-05-07T20:33:27.8830779Z 
2025-05-07T20:33:27.8830949Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:27.8831490Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:27.8831970Z                            module_map=module_map)
2025-05-07T20:33:27.8832346Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:27.8832708Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:27.8832967Z E       ^
2025-05-07T20:33:27.8833500Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:27.8833980Z 
2025-05-07T20:33:27.8834416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:27.8834958Z 
2025-05-07T20:33:27.8835070Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:27.8835491Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:27.8835907Z     T=16384,
2025-05-07T20:33:27.8836108Z     D=7168,
2025-05-07T20:33:27.8836290Z     scale_ub=None,
2025-05-07T20:33:27.8836498Z     contiguous=True,
2025-05-07T20:33:27.8836717Z     compiled=True,
2025-05-07T20:33:27.8836911Z )
2025-05-07T20:33:28.0586459Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.0587248Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:28.0587645Z 
2025-05-07T20:33:28.0587752Z     @given(
2025-05-07T20:33:28.0588072Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.0588445Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.0588764Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.0589103Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.0589562Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.0589859Z     )
2025-05-07T20:33:28.0590210Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.0590660Z     def test_silu_mul_quant(
2025-05-07T20:33:28.0590964Z         self,
2025-05-07T20:33:28.0591161Z         T: int,
2025-05-07T20:33:28.0591355Z         D: int,
2025-05-07T20:33:28.0591568Z         scale_ub: Optional[float],
2025-05-07T20:33:28.0591846Z         contiguous: bool,
2025-05-07T20:33:28.0592087Z         compiled: bool,
2025-05-07T20:33:28.0592438Z     ) -> None:
2025-05-07T20:33:28.0592651Z         torch.manual_seed(2025)
2025-05-07T20:33:28.0592893Z     
2025-05-07T20:33:28.0593163Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.0593514Z     
2025-05-07T20:33:28.0593711Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.0593995Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.0594310Z         x = x_sign * x_clamp
2025-05-07T20:33:28.0594546Z         x0 = x[:, :D]
2025-05-07T20:33:28.0594755Z         x1 = x[:, D:]
2025-05-07T20:33:28.0594960Z     
2025-05-07T20:33:28.0595149Z         if contiguous:
2025-05-07T20:33:28.0595381Z             x0 = x0.contiguous()
2025-05-07T20:33:28.0595640Z             x1 = x1.contiguous()
2025-05-07T20:33:28.0595881Z     
2025-05-07T20:33:28.0596072Z         if scale_ub is not None:
2025-05-07T20:33:28.0596406Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.0596751Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.0597071Z             )
2025-05-07T20:33:28.0597260Z         else:
2025-05-07T20:33:28.0597467Z             scale_ub_tensor = None
2025-05-07T20:33:28.0597725Z     
2025-05-07T20:33:28.0597950Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.0598268Z             op = silu_mul_quant
2025-05-07T20:33:28.0598516Z             if compiled:
2025-05-07T20:33:28.0598758Z                 op = torch.compile(op)
2025-05-07T20:33:28.0599059Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.0599334Z     
2025-05-07T20:33:28.0599516Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.0599682Z 
2025-05-07T20:33:28.0599779Z moe/activation_test.py:117: 
2025-05-07T20:33:28.0600075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.0600406Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.0600689Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.0601264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.0601849Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.0602520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.0603233Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.0603784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.0604490Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.0605173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.0605723Z     kernel = self.compile(
2025-05-07T20:33:28.0606282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.0606964Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.0607367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.0607605Z 
2025-05-07T20:33:28.0607813Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4637470>
2025-05-07T20:33:28.0608926Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.0610435Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e46a84a0>}
2025-05-07T20:33:28.0611839Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.0612961Z context = <triton._C.libtriton.ir.context object at 0x7f08e55322f0>
2025-05-07T20:33:28.0613268Z 
2025-05-07T20:33:28.0613436Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.0613974Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.0614560Z                            module_map=module_map)
2025-05-07T20:33:28.0614932Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.0615294Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.0615552Z E       ^
2025-05-07T20:33:28.0616032Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.0616511Z 
2025-05-07T20:33:28.0616993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.0617538Z 
2025-05-07T20:33:28.0617648Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.0618069Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.0618488Z     T=4096,
2025-05-07T20:33:28.0618678Z     D=5120,
2025-05-07T20:33:28.0618864Z     scale_ub=None,
2025-05-07T20:33:28.0619076Z     contiguous=False,
2025-05-07T20:33:28.0619294Z     compiled=True,
2025-05-07T20:33:28.0619485Z )
2025-05-07T20:33:28.0619807Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.0620332Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:28.0620615Z 
2025-05-07T20:33:28.0620690Z     @given(
2025-05-07T20:33:28.0620911Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.0621221Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.0621528Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.0621859Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.0622181Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.0622470Z     )
2025-05-07T20:33:28.0622818Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.0623265Z     def test_silu_mul_quant(
2025-05-07T20:33:28.0623504Z         self,
2025-05-07T20:33:28.0623693Z         T: int,
2025-05-07T20:33:28.0623881Z         D: int,
2025-05-07T20:33:28.0624093Z         scale_ub: Optional[float],
2025-05-07T20:33:28.0624365Z         contiguous: bool,
2025-05-07T20:33:28.0624597Z         compiled: bool,
2025-05-07T20:33:28.0624807Z     ) -> None:
2025-05-07T20:33:28.0625013Z         torch.manual_seed(2025)
2025-05-07T20:33:28.0625247Z     
2025-05-07T20:33:28.0625694Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.0626043Z     
2025-05-07T20:33:28.0626230Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.0626514Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.0626821Z         x = x_sign * x_clamp
2025-05-07T20:33:28.0627058Z         x0 = x[:, :D]
2025-05-07T20:33:28.0627259Z         x1 = x[:, D:]
2025-05-07T20:33:28.0627462Z     
2025-05-07T20:33:28.0627642Z         if contiguous:
2025-05-07T20:33:28.0627866Z             x0 = x0.contiguous()
2025-05-07T20:33:28.0628118Z             x1 = x1.contiguous()
2025-05-07T20:33:28.0628357Z     
2025-05-07T20:33:28.0628538Z         if scale_ub is not None:
2025-05-07T20:33:28.0628877Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.0629210Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.0629521Z             )
2025-05-07T20:33:28.0629704Z         else:
2025-05-07T20:33:28.0629910Z             scale_ub_tensor = None
2025-05-07T20:33:28.0630158Z     
2025-05-07T20:33:28.0630437Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.0630757Z             op = silu_mul_quant
2025-05-07T20:33:28.0631011Z             if compiled:
2025-05-07T20:33:28.0631255Z                 op = torch.compile(op)
2025-05-07T20:33:28.0631642Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.0631927Z     
2025-05-07T20:33:28.0632113Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.0632281Z 
2025-05-07T20:33:28.0632381Z moe/activation_test.py:117: 
2025-05-07T20:33:28.0632679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.0633014Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.0639979Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.0640612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.0641204Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.0641904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.0642635Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.0643296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.0644025Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.0644725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.0645292Z     kernel = self.compile(
2025-05-07T20:33:28.0645863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.0646568Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.0646982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.0647224Z 
2025-05-07T20:33:28.0647449Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4634b00>
2025-05-07T20:33:28.0648594Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.0650048Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e46a91c0>}
2025-05-07T20:33:28.0651469Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.0652569Z context = <triton._C.libtriton.ir.context object at 0x7f08e55557b0>
2025-05-07T20:33:28.0652872Z 
2025-05-07T20:33:28.0653046Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.0653600Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.0654093Z                            module_map=module_map)
2025-05-07T20:33:28.0654527Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.0654896Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.0655176Z E       ^
2025-05-07T20:33:28.0655662Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.0656140Z 
2025-05-07T20:33:28.0656583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.0657190Z 
2025-05-07T20:33:28.4011926Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.4012601Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.4013208Z     T=4096,
2025-05-07T20:33:28.4013474Z     D=5120,
2025-05-07T20:33:28.4013935Z     scale_ub=1200.0,
2025-05-07T20:33:28.4014258Z     contiguous=False,
2025-05-07T20:33:28.4014573Z     compiled=False,
2025-05-07T20:33:28.4014782Z )
2025-05-07T20:33:28.4015112Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.4015715Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:28.4016006Z 
2025-05-07T20:33:28.4016087Z     @given(
2025-05-07T20:33:28.4016318Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.4016642Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.4016948Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.4017291Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.4017631Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.4017923Z     )
2025-05-07T20:33:28.4018276Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.4018737Z     def test_silu_mul_quant(
2025-05-07T20:33:28.4018987Z         self,
2025-05-07T20:33:28.4019178Z         T: int,
2025-05-07T20:33:28.4019373Z         D: int,
2025-05-07T20:33:28.4019661Z         scale_ub: Optional[float],
2025-05-07T20:33:28.4019941Z         contiguous: bool,
2025-05-07T20:33:28.4020191Z         compiled: bool,
2025-05-07T20:33:28.4020424Z     ) -> None:
2025-05-07T20:33:28.4020633Z         torch.manual_seed(2025)
2025-05-07T20:33:28.4020875Z     
2025-05-07T20:33:28.4021151Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.4021501Z     
2025-05-07T20:33:28.4021696Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.4021998Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.4022308Z         x = x_sign * x_clamp
2025-05-07T20:33:28.4022549Z         x0 = x[:, :D]
2025-05-07T20:33:28.4022762Z         x1 = x[:, D:]
2025-05-07T20:33:28.4022975Z     
2025-05-07T20:33:28.4023161Z         if contiguous:
2025-05-07T20:33:28.4023401Z             x0 = x0.contiguous()
2025-05-07T20:33:28.4023673Z             x1 = x1.contiguous()
2025-05-07T20:33:28.4023937Z     
2025-05-07T20:33:28.4024160Z         if scale_ub is not None:
2025-05-07T20:33:28.4024439Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.4024780Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.4025098Z             )
2025-05-07T20:33:28.4025295Z         else:
2025-05-07T20:33:28.4025680Z             scale_ub_tensor = None
2025-05-07T20:33:28.4025942Z     
2025-05-07T20:33:28.4026181Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.4026497Z             op = silu_mul_quant
2025-05-07T20:33:28.4026757Z             if compiled:
2025-05-07T20:33:28.4027035Z                 op = torch.compile(op)
2025-05-07T20:33:28.4027368Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.4027733Z     
2025-05-07T20:33:28.4027934Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.4028114Z 
2025-05-07T20:33:28.4028220Z moe/activation_test.py:117: 
2025-05-07T20:33:28.4028613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.4028965Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.4029302Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.4030108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.4030897Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.4031516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.4032434Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.4033208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.4033894Z     kernel = self.compile(
2025-05-07T20:33:28.4034632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.4035423Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.4035833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.4036236Z 
2025-05-07T20:33:28.4036455Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4637c20>
2025-05-07T20:33:28.4037680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.4039266Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e46aa160>}
2025-05-07T20:33:28.4040847Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.4042166Z context = <triton._C.libtriton.ir.context object at 0x7f08e555a170>
2025-05-07T20:33:28.4042494Z 
2025-05-07T20:33:28.4042669Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.4043300Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.4043865Z                            module_map=module_map)
2025-05-07T20:33:28.4044238Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.4044680Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.4044944Z E       ^
2025-05-07T20:33:28.4045493Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.4045979Z 
2025-05-07T20:33:28.4046495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.4047103Z 
2025-05-07T20:33:28.4047230Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.4047667Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.4048153Z     T=4096,
2025-05-07T20:33:28.4048355Z     D=5120,
2025-05-07T20:33:28.4048559Z     scale_ub=1200.0,
2025-05-07T20:33:28.4048840Z     contiguous=False,
2025-05-07T20:33:28.4049084Z     compiled=True,
2025-05-07T20:33:28.4049300Z )
2025-05-07T20:33:28.4049675Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.4050242Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:28.4050573Z 
2025-05-07T20:33:28.4050653Z     @given(
2025-05-07T20:33:28.4050897Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.4051282Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.4051611Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.4052024Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.4052358Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.4052671Z     )
2025-05-07T20:33:28.4053087Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.4053572Z     def test_silu_mul_quant(
2025-05-07T20:33:28.4053870Z         self,
2025-05-07T20:33:28.4054081Z         T: int,
2025-05-07T20:33:28.4054284Z         D: int,
2025-05-07T20:33:28.4054709Z         scale_ub: Optional[float],
2025-05-07T20:33:28.4054992Z         contiguous: bool,
2025-05-07T20:33:28.4055233Z         compiled: bool,
2025-05-07T20:33:28.4055655Z     ) -> None:
2025-05-07T20:33:28.4055882Z         torch.manual_seed(2025)
2025-05-07T20:33:28.4056145Z     
2025-05-07T20:33:28.4056426Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.4056786Z     
2025-05-07T20:33:28.4056988Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.4057328Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.4057657Z         x = x_sign * x_clamp
2025-05-07T20:33:28.4057909Z         x0 = x[:, :D]
2025-05-07T20:33:28.4058125Z         x1 = x[:, D:]
2025-05-07T20:33:28.4058390Z     
2025-05-07T20:33:28.4058577Z         if contiguous:
2025-05-07T20:33:28.4058810Z             x0 = x0.contiguous()
2025-05-07T20:33:28.4059078Z             x1 = x1.contiguous()
2025-05-07T20:33:28.4059324Z     
2025-05-07T20:33:28.4059510Z         if scale_ub is not None:
2025-05-07T20:33:28.4059791Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.4060144Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.4060463Z             )
2025-05-07T20:33:28.4060664Z         else:
2025-05-07T20:33:28.4060877Z             scale_ub_tensor = None
2025-05-07T20:33:28.4061138Z     
2025-05-07T20:33:28.4061374Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.4061702Z             op = silu_mul_quant
2025-05-07T20:33:28.4061957Z             if compiled:
2025-05-07T20:33:28.4062202Z                 op = torch.compile(op)
2025-05-07T20:33:28.4062557Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.4062853Z     
2025-05-07T20:33:28.4063047Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.4063223Z 
2025-05-07T20:33:28.4063324Z moe/activation_test.py:117: 
2025-05-07T20:33:28.4063634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.4063972Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.4064266Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.4064856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.4065451Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.4066137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.4066872Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.4067437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.4068166Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.4068869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.4069432Z     kernel = self.compile(
2025-05-07T20:33:28.4069992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.4070681Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.4071097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.4071341Z 
2025-05-07T20:33:28.4071551Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4637680>
2025-05-07T20:33:28.4072687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.4074172Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e46ab240>}
2025-05-07T20:33:28.4075581Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.4076722Z context = <triton._C.libtriton.ir.context object at 0x7f08e5163cf0>
2025-05-07T20:33:28.4077024Z 
2025-05-07T20:33:28.4077198Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.4077815Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.4078313Z                            module_map=module_map)
2025-05-07T20:33:28.4078694Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.4079057Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.4079373Z E       ^
2025-05-07T20:33:28.4079862Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.4080339Z 
2025-05-07T20:33:28.4080778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.4081331Z 
2025-05-07T20:33:28.5233687Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.5234413Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.5235032Z     T=2048,
2025-05-07T20:33:28.5235297Z     D=7168,
2025-05-07T20:33:28.5235550Z     scale_ub=1200.0,
2025-05-07T20:33:28.5235787Z     contiguous=False,
2025-05-07T20:33:28.5236024Z     compiled=False,
2025-05-07T20:33:28.5236242Z )
2025-05-07T20:33:28.5236578Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.5237260Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:28.5237561Z 
2025-05-07T20:33:28.5237640Z     @given(
2025-05-07T20:33:28.5237874Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.5238184Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.5238496Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.5238834Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.5239172Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.5239457Z     )
2025-05-07T20:33:28.5239818Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.5240278Z     def test_silu_mul_quant(
2025-05-07T20:33:28.5240521Z         self,
2025-05-07T20:33:28.5240719Z         T: int,
2025-05-07T20:33:28.5240925Z         D: int,
2025-05-07T20:33:28.5241141Z         scale_ub: Optional[float],
2025-05-07T20:33:28.5241421Z         contiguous: bool,
2025-05-07T20:33:28.5241667Z         compiled: bool,
2025-05-07T20:33:28.5241895Z     ) -> None:
2025-05-07T20:33:28.5242126Z         torch.manual_seed(2025)
2025-05-07T20:33:28.5242372Z     
2025-05-07T20:33:28.5242646Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.5243006Z     
2025-05-07T20:33:28.5243209Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.5243504Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.5243825Z         x = x_sign * x_clamp
2025-05-07T20:33:28.5244074Z         x0 = x[:, :D]
2025-05-07T20:33:28.5244296Z         x1 = x[:, D:]
2025-05-07T20:33:28.5244502Z     
2025-05-07T20:33:28.5244694Z         if contiguous:
2025-05-07T20:33:28.5244933Z             x0 = x0.contiguous()
2025-05-07T20:33:28.5245382Z             x1 = x1.contiguous()
2025-05-07T20:33:28.5245639Z     
2025-05-07T20:33:28.5245839Z         if scale_ub is not None:
2025-05-07T20:33:28.5246114Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.5246457Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.5246782Z             )
2025-05-07T20:33:28.5246972Z         else:
2025-05-07T20:33:28.5247191Z             scale_ub_tensor = None
2025-05-07T20:33:28.5247451Z     
2025-05-07T20:33:28.5247680Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.5248009Z             op = silu_mul_quant
2025-05-07T20:33:28.5248266Z             if compiled:
2025-05-07T20:33:28.5248511Z                 op = torch.compile(op)
2025-05-07T20:33:28.5248904Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.5249192Z     
2025-05-07T20:33:28.5249393Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.5249566Z 
2025-05-07T20:33:28.5249672Z moe/activation_test.py:117: 
2025-05-07T20:33:28.5250050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.5250399Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.5250688Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.5251422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.5252230Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.5252790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.5253517Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.5254223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.5254904Z     kernel = self.compile(
2025-05-07T20:33:28.5255469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.5256179Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.5256596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.5256884Z 
2025-05-07T20:33:28.5257110Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e515d820>
2025-05-07T20:33:28.5258234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.5259685Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e51a4220>}
2025-05-07T20:33:28.5261116Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.5262210Z context = <triton._C.libtriton.ir.context object at 0x7f08e45117b0>
2025-05-07T20:33:28.5262514Z 
2025-05-07T20:33:28.5262699Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.5263247Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.5263748Z                            module_map=module_map)
2025-05-07T20:33:28.5264159Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.5264537Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.5264810Z E       ^
2025-05-07T20:33:28.5265300Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.5265779Z 
2025-05-07T20:33:28.5266227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.5266778Z 
2025-05-07T20:33:28.5266895Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.5267326Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.5267756Z     T=1,
2025-05-07T20:33:28.5267958Z     D=7168,
2025-05-07T20:33:28.5268156Z     scale_ub=None,
2025-05-07T20:33:28.5268390Z     contiguous=True,
2025-05-07T20:33:28.5268626Z     compiled=False,
2025-05-07T20:33:28.5268835Z )
2025-05-07T20:33:28.5269173Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.5269687Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:28.5269961Z 
2025-05-07T20:33:28.5270047Z     @given(
2025-05-07T20:33:28.5270338Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.5270674Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.5271000Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.5271337Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.5271726Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.5272026Z     )
2025-05-07T20:33:28.5272388Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.5272856Z     def test_silu_mul_quant(
2025-05-07T20:33:28.5273154Z         self,
2025-05-07T20:33:28.5273355Z         T: int,
2025-05-07T20:33:28.5273563Z         D: int,
2025-05-07T20:33:28.5273796Z         scale_ub: Optional[float],
2025-05-07T20:33:28.5274082Z         contiguous: bool,
2025-05-07T20:33:28.5274337Z         compiled: bool,
2025-05-07T20:33:28.5274572Z     ) -> None:
2025-05-07T20:33:28.5274790Z         torch.manual_seed(2025)
2025-05-07T20:33:28.5275044Z     
2025-05-07T20:33:28.5275331Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.5275690Z     
2025-05-07T20:33:28.5275881Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.5276180Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.5276509Z         x = x_sign * x_clamp
2025-05-07T20:33:28.5276747Z         x0 = x[:, :D]
2025-05-07T20:33:28.5276975Z         x1 = x[:, D:]
2025-05-07T20:33:28.5277192Z     
2025-05-07T20:33:28.5277425Z         if contiguous:
2025-05-07T20:33:28.5277666Z             x0 = x0.contiguous()
2025-05-07T20:33:28.5277939Z             x1 = x1.contiguous()
2025-05-07T20:33:28.5278179Z     
2025-05-07T20:33:28.5278379Z         if scale_ub is not None:
2025-05-07T20:33:28.5278661Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.5278997Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.5279312Z             )
2025-05-07T20:33:28.5279510Z         else:
2025-05-07T20:33:28.5279725Z             scale_ub_tensor = None
2025-05-07T20:33:28.5279990Z     
2025-05-07T20:33:28.5280233Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.5280557Z             op = silu_mul_quant
2025-05-07T20:33:28.5280804Z             if compiled:
2025-05-07T20:33:28.5281061Z                 op = torch.compile(op)
2025-05-07T20:33:28.5281367Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.5281648Z     
2025-05-07T20:33:28.5281846Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.5282016Z 
2025-05-07T20:33:28.5282120Z moe/activation_test.py:117: 
2025-05-07T20:33:28.5282416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.5282758Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.5283051Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.5283768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.5284500Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.5285065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.5285788Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.5286484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.5287040Z     kernel = self.compile(
2025-05-07T20:33:28.5287606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.5288299Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.5288704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.5288947Z 
2025-05-07T20:33:28.5289157Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e515cef0>
2025-05-07T20:33:28.5290281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.5291799Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e51a5120>}
2025-05-07T20:33:28.5293216Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.5294378Z context = <triton._C.libtriton.ir.context object at 0x7f08e4816330>
2025-05-07T20:33:28.5294764Z 
2025-05-07T20:33:28.5294937Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.5295484Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.5295969Z                            module_map=module_map)
2025-05-07T20:33:28.5296345Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.5296715Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.5296978Z E       ^
2025-05-07T20:33:28.5297467Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.5297944Z 
2025-05-07T20:33:28.5298425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.5298973Z 
2025-05-07T20:33:28.5299089Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.5299516Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.5299938Z     T=16384,
2025-05-07T20:33:28.5300139Z     D=7168,
2025-05-07T20:33:28.5300328Z     scale_ub=1200.0,
2025-05-07T20:33:28.5300556Z     contiguous=False,
2025-05-07T20:33:28.5300784Z     compiled=True,
2025-05-07T20:33:28.7709819Z )
2025-05-07T20:33:28.7710641Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7711453Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:28.7711869Z 
2025-05-07T20:33:28.7711986Z     @given(
2025-05-07T20:33:28.7712277Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7712604Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7712935Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7713290Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7713632Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7713938Z     )
2025-05-07T20:33:28.7714304Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7714766Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7715023Z         self,
2025-05-07T20:33:28.7715237Z         T: int,
2025-05-07T20:33:28.7715440Z         D: int,
2025-05-07T20:33:28.7715675Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7715963Z         contiguous: bool,
2025-05-07T20:33:28.7716215Z         compiled: bool,
2025-05-07T20:33:28.7716444Z     ) -> None:
2025-05-07T20:33:28.7716675Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7716933Z     
2025-05-07T20:33:28.7717217Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7717582Z     
2025-05-07T20:33:28.7717796Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.7718094Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.7718428Z         x = x_sign * x_clamp
2025-05-07T20:33:28.7718680Z         x0 = x[:, :D]
2025-05-07T20:33:28.7718904Z         x1 = x[:, D:]
2025-05-07T20:33:28.7719128Z     
2025-05-07T20:33:28.7719333Z         if contiguous:
2025-05-07T20:33:28.7719579Z             x0 = x0.contiguous()
2025-05-07T20:33:28.7719855Z             x1 = x1.contiguous()
2025-05-07T20:33:28.7720237Z     
2025-05-07T20:33:28.7720431Z         if scale_ub is not None:
2025-05-07T20:33:28.7720718Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.7721067Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.7721383Z             )
2025-05-07T20:33:28.7721587Z         else:
2025-05-07T20:33:28.7721873Z             scale_ub_tensor = None
2025-05-07T20:33:28.7722131Z     
2025-05-07T20:33:28.7722369Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.7722707Z             op = silu_mul_quant
2025-05-07T20:33:28.7723034Z             if compiled:
2025-05-07T20:33:28.7723284Z                 op = torch.compile(op)
2025-05-07T20:33:28.7723606Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7723900Z     
2025-05-07T20:33:28.7724099Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.7724282Z 
2025-05-07T20:33:28.7724391Z moe/activation_test.py:117: 
2025-05-07T20:33:28.7724702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7725048Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.7725347Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7726239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.7726844Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.7727618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.7728357Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.7728932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.7729654Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.7730365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.7730943Z     kernel = self.compile(
2025-05-07T20:33:28.7731521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.7732215Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.7732647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7732888Z 
2025-05-07T20:33:28.7733119Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e515e180>
2025-05-07T20:33:28.7734262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.7735815Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e51a6520>}
2025-05-07T20:33:28.7737236Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.7738334Z context = <triton._C.libtriton.ir.context object at 0x7f08e48a2c70>
2025-05-07T20:33:28.7738637Z 
2025-05-07T20:33:28.7738828Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.7739377Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.7739879Z                            module_map=module_map)
2025-05-07T20:33:28.7740273Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.7740654Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.7740932Z E       ^
2025-05-07T20:33:28.7741428Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.7741906Z 
2025-05-07T20:33:28.7742425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.7742980Z 
2025-05-07T20:33:28.7743102Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7743602Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7744038Z     T=1,
2025-05-07T20:33:28.7744242Z     D=7168,
2025-05-07T20:33:28.7744446Z     scale_ub=None,
2025-05-07T20:33:28.7744680Z     contiguous=False,
2025-05-07T20:33:28.7744931Z     compiled=False,
2025-05-07T20:33:28.7745207Z )
2025-05-07T20:33:28.7745550Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.7746070Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:28.7746348Z 
2025-05-07T20:33:28.7746434Z     @given(
2025-05-07T20:33:28.7746679Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.7747021Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.7747358Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.7747709Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.7748067Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.7748385Z     )
2025-05-07T20:33:28.7748756Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.7749236Z     def test_silu_mul_quant(
2025-05-07T20:33:28.7749500Z         self,
2025-05-07T20:33:28.7749761Z         T: int,
2025-05-07T20:33:28.7749987Z         D: int,
2025-05-07T20:33:28.7750230Z         scale_ub: Optional[float],
2025-05-07T20:33:28.7750507Z         contiguous: bool,
2025-05-07T20:33:28.7750762Z         compiled: bool,
2025-05-07T20:33:28.7751002Z     ) -> None:
2025-05-07T20:33:28.7751223Z         torch.manual_seed(2025)
2025-05-07T20:33:28.7751476Z     
2025-05-07T20:33:28.7751773Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.7752152Z     
2025-05-07T20:33:28.7752354Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.7752668Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.7752999Z         x = x_sign * x_clamp
2025-05-07T20:33:28.7753253Z         x0 = x[:, :D]
2025-05-07T20:33:28.7753490Z         x1 = x[:, D:]
2025-05-07T20:33:28.7753712Z     
2025-05-07T20:33:28.7753896Z         if contiguous:
2025-05-07T20:33:28.7754141Z             x0 = x0.contiguous()
2025-05-07T20:33:28.7754414Z             x1 = x1.contiguous()
2025-05-07T20:33:28.7754663Z     
2025-05-07T20:33:28.7754866Z         if scale_ub is not None:
2025-05-07T20:33:28.7755152Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.7755491Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.7755817Z             )
2025-05-07T20:33:28.7756019Z         else:
2025-05-07T20:33:28.7756229Z             scale_ub_tensor = None
2025-05-07T20:33:28.7756492Z     
2025-05-07T20:33:28.7756733Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.7757060Z             op = silu_mul_quant
2025-05-07T20:33:28.7757307Z             if compiled:
2025-05-07T20:33:28.7757559Z                 op = torch.compile(op)
2025-05-07T20:33:28.7757864Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7758137Z     
2025-05-07T20:33:28.7758334Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.7758498Z 
2025-05-07T20:33:28.7758606Z moe/activation_test.py:117: 
2025-05-07T20:33:28.7758901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7759249Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.7759536Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.7760259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.7760986Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.7761550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.7762327Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.7763024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.7763628Z     kernel = self.compile(
2025-05-07T20:33:28.7764205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.7764899Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.7765349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.7765595Z 
2025-05-07T20:33:28.7765811Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e515f860>
2025-05-07T20:33:28.7766937Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.7768367Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e51a7100>}
2025-05-07T20:33:28.7769830Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.7770919Z context = <triton._C.libtriton.ir.context object at 0x7f08e4884b30>
2025-05-07T20:33:28.7771222Z 
2025-05-07T20:33:28.7771392Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.7771938Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.7772420Z                            module_map=module_map)
2025-05-07T20:33:28.7772793Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.7773157Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.7773419Z E       ^
2025-05-07T20:33:28.7773907Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.7774383Z 
2025-05-07T20:33:28.7774911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.7775458Z 
2025-05-07T20:33:28.7775571Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.7776003Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.7776425Z     T=2048,
2025-05-07T20:33:28.7776619Z     D=7168,
2025-05-07T20:33:28.7776814Z     scale_ub=None,
2025-05-07T20:33:28.7777038Z     contiguous=False,
2025-05-07T20:33:28.7777269Z     compiled=True,
2025-05-07T20:33:28.7777481Z )
2025-05-07T20:33:28.8651628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.8652531Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:28.8652955Z 
2025-05-07T20:33:28.8653068Z     @given(
2025-05-07T20:33:28.8653316Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.8653646Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.8654001Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.8654340Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.8654741Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.8655031Z     )
2025-05-07T20:33:28.8655392Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.8655853Z     def test_silu_mul_quant(
2025-05-07T20:33:28.8656094Z         self,
2025-05-07T20:33:28.8656299Z         T: int,
2025-05-07T20:33:28.8656504Z         D: int,
2025-05-07T20:33:28.8656730Z         scale_ub: Optional[float],
2025-05-07T20:33:28.8657119Z         contiguous: bool,
2025-05-07T20:33:28.8657368Z         compiled: bool,
2025-05-07T20:33:28.8657596Z     ) -> None:
2025-05-07T20:33:28.8657807Z         torch.manual_seed(2025)
2025-05-07T20:33:28.8658043Z     
2025-05-07T20:33:28.8658323Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.8658771Z     
2025-05-07T20:33:28.8658964Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.8659257Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.8659577Z         x = x_sign * x_clamp
2025-05-07T20:33:28.8659813Z         x0 = x[:, :D]
2025-05-07T20:33:28.8660093Z         x1 = x[:, D:]
2025-05-07T20:33:28.8660299Z     
2025-05-07T20:33:28.8660493Z         if contiguous:
2025-05-07T20:33:28.8660728Z             x0 = x0.contiguous()
2025-05-07T20:33:28.8660992Z             x1 = x1.contiguous()
2025-05-07T20:33:28.8661247Z     
2025-05-07T20:33:28.8661434Z         if scale_ub is not None:
2025-05-07T20:33:28.8661711Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.8662049Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.8662358Z             )
2025-05-07T20:33:28.8662545Z         else:
2025-05-07T20:33:28.8662755Z             scale_ub_tensor = None
2025-05-07T20:33:28.8663011Z     
2025-05-07T20:33:28.8663246Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.8663564Z             op = silu_mul_quant
2025-05-07T20:33:28.8663812Z             if compiled:
2025-05-07T20:33:28.8664125Z                 op = torch.compile(op)
2025-05-07T20:33:28.8664429Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.8664704Z     
2025-05-07T20:33:28.8664898Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.8665070Z 
2025-05-07T20:33:28.8665168Z moe/activation_test.py:117: 
2025-05-07T20:33:28.8665466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.8665798Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.8666085Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.8666670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.8667255Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.8667946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.8668670Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.8669236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.8669942Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.8670639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.8671196Z     kernel = self.compile(
2025-05-07T20:33:28.8671750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.8672438Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.8672840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.8673075Z 
2025-05-07T20:33:28.8673294Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e48e57c0>
2025-05-07T20:33:28.8674404Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.8675833Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4704720>}
2025-05-07T20:33:28.8677236Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.8678370Z context = <triton._C.libtriton.ir.context object at 0x7f08e472f2b0>
2025-05-07T20:33:28.8678669Z 
2025-05-07T20:33:28.8678841Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.8679412Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.8679894Z                            module_map=module_map)
2025-05-07T20:33:28.8680272Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.8680665Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.8680926Z E       ^
2025-05-07T20:33:28.8681402Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.8681871Z 
2025-05-07T20:33:28.8682313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.8682857Z 
2025-05-07T20:33:28.8682959Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.8683377Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.8683791Z     T=4096,
2025-05-07T20:33:28.8683971Z     D=7168,
2025-05-07T20:33:28.8684165Z     scale_ub=None,
2025-05-07T20:33:28.8684388Z     contiguous=False,
2025-05-07T20:33:28.8684608Z     compiled=True,
2025-05-07T20:33:28.8684808Z )
2025-05-07T20:33:28.8685180Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.8685694Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:28.8685974Z 
2025-05-07T20:33:28.8686054Z     @given(
2025-05-07T20:33:28.8686285Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.8686605Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.8686908Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.8687244Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.8687579Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.8687865Z     )
2025-05-07T20:33:28.8688219Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.8688672Z     def test_silu_mul_quant(
2025-05-07T20:33:28.8688917Z         self,
2025-05-07T20:33:28.8689103Z         T: int,
2025-05-07T20:33:28.8689298Z         D: int,
2025-05-07T20:33:28.8689527Z         scale_ub: Optional[float],
2025-05-07T20:33:28.8689798Z         contiguous: bool,
2025-05-07T20:33:28.8690044Z         compiled: bool,
2025-05-07T20:33:28.8690265Z     ) -> None:
2025-05-07T20:33:28.8690473Z         torch.manual_seed(2025)
2025-05-07T20:33:28.8690718Z     
2025-05-07T20:33:28.8690991Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.8691337Z     
2025-05-07T20:33:28.8691527Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.8691817Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.8692128Z         x = x_sign * x_clamp
2025-05-07T20:33:28.8692368Z         x0 = x[:, :D]
2025-05-07T20:33:28.8692584Z         x1 = x[:, D:]
2025-05-07T20:33:28.8692790Z     
2025-05-07T20:33:28.8692975Z         if contiguous:
2025-05-07T20:33:28.8693212Z             x0 = x0.contiguous()
2025-05-07T20:33:28.8693468Z             x1 = x1.contiguous()
2025-05-07T20:33:28.8693737Z     
2025-05-07T20:33:28.8693954Z         if scale_ub is not None:
2025-05-07T20:33:28.8694233Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.8694643Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.8694960Z             )
2025-05-07T20:33:28.8695149Z         else:
2025-05-07T20:33:28.8695352Z             scale_ub_tensor = None
2025-05-07T20:33:28.8695604Z     
2025-05-07T20:33:28.8695833Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.8696144Z             op = silu_mul_quant
2025-05-07T20:33:28.8696394Z             if compiled:
2025-05-07T20:33:28.8696692Z                 op = torch.compile(op)
2025-05-07T20:33:28.8696992Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.8697270Z     
2025-05-07T20:33:28.8697454Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.8697621Z 
2025-05-07T20:33:28.8697719Z moe/activation_test.py:117: 
2025-05-07T20:33:28.8698064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.8698399Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.8698682Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.8699293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:28.8699875Z     return fn(*args, **kwargs)
2025-05-07T20:33:28.8700557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.8701268Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.8701833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.8702553Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.8703252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.8703843Z     kernel = self.compile(
2025-05-07T20:33:28.8704464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.8705155Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.8705562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.8705799Z 
2025-05-07T20:33:28.8706008Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e48e67e0>
2025-05-07T20:33:28.8707130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.8708564Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4705440>}
2025-05-07T20:33:28.8709970Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.8711044Z context = <triton._C.libtriton.ir.context object at 0x7f08e44f9f70>
2025-05-07T20:33:28.8711346Z 
2025-05-07T20:33:28.8711511Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.8712048Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.8712530Z                            module_map=module_map)
2025-05-07T20:33:28.8712903Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.8713264Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.8713529Z E       ^
2025-05-07T20:33:28.8714005Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.8714532Z 
2025-05-07T20:33:28.8714968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.8715516Z 
2025-05-07T20:33:29.0314954Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.0316225Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.0317360Z     T=16384,
2025-05-07T20:33:29.0317855Z     D=5120,
2025-05-07T20:33:29.0318229Z     scale_ub=1200.0,
2025-05-07T20:33:29.0318669Z     contiguous=False,
2025-05-07T20:33:29.0319108Z     compiled=False,
2025-05-07T20:33:29.0319490Z )
2025-05-07T20:33:29.0320336Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.0321357Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:29.0321934Z 
2025-05-07T20:33:29.0322090Z     @given(
2025-05-07T20:33:29.0322646Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.0323278Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.0323885Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.0324350Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.0324771Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.0325063Z     )
2025-05-07T20:33:29.0325593Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.0326057Z     def test_silu_mul_quant(
2025-05-07T20:33:29.0326298Z         self,
2025-05-07T20:33:29.0326488Z         T: int,
2025-05-07T20:33:29.0326685Z         D: int,
2025-05-07T20:33:29.0326907Z         scale_ub: Optional[float],
2025-05-07T20:33:29.0327183Z         contiguous: bool,
2025-05-07T20:33:29.0327421Z         compiled: bool,
2025-05-07T20:33:29.0327646Z     ) -> None:
2025-05-07T20:33:29.0327860Z         torch.manual_seed(2025)
2025-05-07T20:33:29.0328095Z     
2025-05-07T20:33:29.0328373Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.0328720Z     
2025-05-07T20:33:29.0328906Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.0329262Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.0329579Z         x = x_sign * x_clamp
2025-05-07T20:33:29.0329814Z         x0 = x[:, :D]
2025-05-07T20:33:29.0330028Z         x1 = x[:, D:]
2025-05-07T20:33:29.0330233Z     
2025-05-07T20:33:29.0330417Z         if contiguous:
2025-05-07T20:33:29.0330653Z             x0 = x0.contiguous()
2025-05-07T20:33:29.0330915Z             x1 = x1.contiguous()
2025-05-07T20:33:29.0331154Z     
2025-05-07T20:33:29.0331344Z         if scale_ub is not None:
2025-05-07T20:33:29.0331619Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.0331948Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.0332261Z             )
2025-05-07T20:33:29.0332449Z         else:
2025-05-07T20:33:29.0332661Z             scale_ub_tensor = None
2025-05-07T20:33:29.0332913Z     
2025-05-07T20:33:29.0333149Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.0333463Z             op = silu_mul_quant
2025-05-07T20:33:29.0333715Z             if compiled:
2025-05-07T20:33:29.0333960Z                 op = torch.compile(op)
2025-05-07T20:33:29.0341433Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.0341747Z     
2025-05-07T20:33:29.0341942Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:29.0342112Z 
2025-05-07T20:33:29.0342217Z moe/activation_test.py:117: 
2025-05-07T20:33:29.0342511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.0342854Z moe/activation_test.py:115: in fn
2025-05-07T20:33:29.0343137Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.0343853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:29.0344573Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:29.0345129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.0345842Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.0346530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.0347077Z     kernel = self.compile(
2025-05-07T20:33:29.0347639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.0348326Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.0348842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.0349082Z 
2025-05-07T20:33:29.0349290Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e48e6600>
2025-05-07T20:33:29.0350475Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.0351905Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4706340>}
2025-05-07T20:33:29.0353374Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.0354450Z context = <triton._C.libtriton.ir.context object at 0x7f08e444b7b0>
2025-05-07T20:33:29.0354753Z 
2025-05-07T20:33:29.0354925Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.0355460Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.0355948Z                            module_map=module_map)
2025-05-07T20:33:29.0356308Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.0356708Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:29.0356970Z E       ^
2025-05-07T20:33:29.0357440Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.0357917Z 
2025-05-07T20:33:29.0358355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.0358900Z 
2025-05-07T20:33:29.0359002Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.0359428Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.0359840Z     T=16384,
2025-05-07T20:33:29.0360033Z     D=5120,
2025-05-07T20:33:29.0360221Z     scale_ub=1200.0,
2025-05-07T20:33:29.0360435Z     contiguous=True,
2025-05-07T20:33:29.0360651Z     compiled=True,
2025-05-07T20:33:29.0360849Z )
2025-05-07T20:33:29.0361161Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.0361671Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:29.0361953Z 
2025-05-07T20:33:29.0362041Z     @given(
2025-05-07T20:33:29.0362262Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.0362575Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.0362882Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.0363208Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.0363536Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.0363830Z     )
2025-05-07T20:33:29.0364180Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.0364627Z     def test_silu_mul_quant(
2025-05-07T20:33:29.0364866Z         self,
2025-05-07T20:33:29.0365059Z         T: int,
2025-05-07T20:33:29.0365248Z         D: int,
2025-05-07T20:33:29.0365464Z         scale_ub: Optional[float],
2025-05-07T20:33:29.0365733Z         contiguous: bool,
2025-05-07T20:33:29.0365962Z         compiled: bool,
2025-05-07T20:33:29.0366183Z     ) -> None:
2025-05-07T20:33:29.0366389Z         torch.manual_seed(2025)
2025-05-07T20:33:29.0366628Z     
2025-05-07T20:33:29.0366898Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.0367250Z     
2025-05-07T20:33:29.0367435Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.0367728Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.0368044Z         x = x_sign * x_clamp
2025-05-07T20:33:29.0368281Z         x0 = x[:, :D]
2025-05-07T20:33:29.0368543Z         x1 = x[:, D:]
2025-05-07T20:33:29.0368749Z     
2025-05-07T20:33:29.0368927Z         if contiguous:
2025-05-07T20:33:29.0369149Z             x0 = x0.contiguous()
2025-05-07T20:33:29.0369410Z             x1 = x1.contiguous()
2025-05-07T20:33:29.0369653Z     
2025-05-07T20:33:29.0369882Z         if scale_ub is not None:
2025-05-07T20:33:29.0370157Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.0370499Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.0370802Z             )
2025-05-07T20:33:29.0371029Z         else:
2025-05-07T20:33:29.0371238Z             scale_ub_tensor = None
2025-05-07T20:33:29.0371489Z     
2025-05-07T20:33:29.0371723Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.0372040Z             op = silu_mul_quant
2025-05-07T20:33:29.0372284Z             if compiled:
2025-05-07T20:33:29.0372527Z                 op = torch.compile(op)
2025-05-07T20:33:29.0372822Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.0373100Z     
2025-05-07T20:33:29.0373284Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:29.0373453Z 
2025-05-07T20:33:29.0373546Z moe/activation_test.py:117: 
2025-05-07T20:33:29.0373846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.0374187Z moe/activation_test.py:115: in fn
2025-05-07T20:33:29.0374611Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.0375230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:29.0375812Z     return fn(*args, **kwargs)
2025-05-07T20:33:29.0376491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:29.0377213Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:29.0377764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.0378472Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.0379161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.0379715Z     kernel = self.compile(
2025-05-07T20:33:29.0380276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.0380955Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.0381361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.0381604Z 
2025-05-07T20:33:29.0381814Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e54d8200>
2025-05-07T20:33:29.0382930Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.0384378Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e47079c0>}
2025-05-07T20:33:29.0385807Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.0386891Z context = <triton._C.libtriton.ir.context object at 0x7f08e54510b0>
2025-05-07T20:33:29.0387187Z 
2025-05-07T20:33:29.0387360Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.0387893Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.0388377Z                            module_map=module_map)
2025-05-07T20:33:29.0388743Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.0389148Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:29.0389408Z E       ^
2025-05-07T20:33:29.0389887Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.0390357Z 
2025-05-07T20:33:29.0390836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.0391377Z 
2025-05-07T20:33:29.2089255Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.2090490Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.2091861Z     T=16384,
2025-05-07T20:33:29.2092287Z     D=5120,
2025-05-07T20:33:29.2092656Z     scale_ub=None,
2025-05-07T20:33:29.2093066Z     contiguous=False,
2025-05-07T20:33:29.2093507Z     compiled=True,
2025-05-07T20:33:29.2093898Z )
2025-05-07T20:33:29.2094254Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.2094887Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:29.2095184Z 
2025-05-07T20:33:29.2095268Z     @given(
2025-05-07T20:33:29.2095505Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.2095823Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.2096145Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.2096482Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.2096888Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.2097181Z     )
2025-05-07T20:33:29.2097533Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.2097977Z     def test_silu_mul_quant(
2025-05-07T20:33:29.2098241Z         self,
2025-05-07T20:33:29.2098434Z         T: int,
2025-05-07T20:33:29.2098628Z         D: int,
2025-05-07T20:33:29.2098840Z         scale_ub: Optional[float],
2025-05-07T20:33:29.2099115Z         contiguous: bool,
2025-05-07T20:33:29.2099351Z         compiled: bool,
2025-05-07T20:33:29.2099570Z     ) -> None:
2025-05-07T20:33:29.2099780Z         torch.manual_seed(2025)
2025-05-07T20:33:29.2100021Z     
2025-05-07T20:33:29.2100292Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.2100642Z     
2025-05-07T20:33:29.2100842Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.2101132Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.2101440Z         x = x_sign * x_clamp
2025-05-07T20:33:29.2101682Z         x0 = x[:, :D]
2025-05-07T20:33:29.2101899Z         x1 = x[:, D:]
2025-05-07T20:33:29.2102103Z     
2025-05-07T20:33:29.2102282Z         if contiguous:
2025-05-07T20:33:29.2102519Z             x0 = x0.contiguous()
2025-05-07T20:33:29.2102774Z             x1 = x1.contiguous()
2025-05-07T20:33:29.2103012Z     
2025-05-07T20:33:29.2103202Z         if scale_ub is not None:
2025-05-07T20:33:29.2103470Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.2103803Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.2104117Z             )
2025-05-07T20:33:29.2104305Z         else:
2025-05-07T20:33:29.2104506Z             scale_ub_tensor = None
2025-05-07T20:33:29.2104764Z     
2025-05-07T20:33:29.2104998Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.2105313Z             op = silu_mul_quant
2025-05-07T20:33:29.2105562Z             if compiled:
2025-05-07T20:33:29.2105807Z                 op = torch.compile(op)
2025-05-07T20:33:29.2106101Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.2106383Z     
2025-05-07T20:33:29.2106568Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:29.2106734Z 
2025-05-07T20:33:29.2106829Z moe/activation_test.py:117: 
2025-05-07T20:33:29.2108597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.2108931Z moe/activation_test.py:115: in fn
2025-05-07T20:33:29.2109212Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.2109863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:29.2110453Z     return fn(*args, **kwargs)
2025-05-07T20:33:29.2111194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:29.2111918Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:29.2112480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.2113197Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.2113960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.2114516Z     kernel = self.compile(
2025-05-07T20:33:29.2115075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.2115767Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.2116181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.2116426Z 
2025-05-07T20:33:29.2116641Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e54d90d0>
2025-05-07T20:33:29.2117815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.2119255Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e5484c20>}
2025-05-07T20:33:29.2120667Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.2121753Z context = <triton._C.libtriton.ir.context object at 0x7f08e545dd30>
2025-05-07T20:33:29.2122059Z 
2025-05-07T20:33:29.2122230Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.2122774Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.2123260Z                            module_map=module_map)
2025-05-07T20:33:29.2123633Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.2123997Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:29.2124267Z E       ^
2025-05-07T20:33:29.2124743Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.2125218Z 
2025-05-07T20:33:29.2125843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.2126393Z 
2025-05-07T20:33:29.2126502Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.2126932Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.2127345Z     T=2048,
2025-05-07T20:33:29.2127535Z     D=5120,
2025-05-07T20:33:29.2127729Z     scale_ub=None,
2025-05-07T20:33:29.2127943Z     contiguous=False,
2025-05-07T20:33:29.2128174Z     compiled=True,
2025-05-07T20:33:29.2128375Z )
2025-05-07T20:33:29.3032305Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.3033851Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:29.3034355Z 
2025-05-07T20:33:29.3034464Z     @given(
2025-05-07T20:33:29.3034745Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.3035068Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.3035377Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.3035715Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.3036153Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.3036438Z     )
2025-05-07T20:33:29.3036783Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.3037232Z     def test_silu_mul_quant(
2025-05-07T20:33:29.3037472Z         self,
2025-05-07T20:33:29.3037727Z         T: int,
2025-05-07T20:33:29.3037933Z         D: int,
2025-05-07T20:33:29.3038156Z         scale_ub: Optional[float],
2025-05-07T20:33:29.3038427Z         contiguous: bool,
2025-05-07T20:33:29.3038673Z         compiled: bool,
2025-05-07T20:33:29.3038958Z     ) -> None:
2025-05-07T20:33:29.3039166Z         torch.manual_seed(2025)
2025-05-07T20:33:29.3039409Z     
2025-05-07T20:33:29.3039680Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.3040022Z     
2025-05-07T20:33:29.3040211Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.3040503Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.3040817Z         x = x_sign * x_clamp
2025-05-07T20:33:29.3041047Z         x0 = x[:, :D]
2025-05-07T20:33:29.3041255Z         x1 = x[:, D:]
2025-05-07T20:33:29.3041460Z     
2025-05-07T20:33:29.3041633Z         if contiguous:
2025-05-07T20:33:29.3041863Z             x0 = x0.contiguous()
2025-05-07T20:33:29.3042120Z             x1 = x1.contiguous()
2025-05-07T20:33:29.3042359Z     
2025-05-07T20:33:29.3042544Z         if scale_ub is not None:
2025-05-07T20:33:29.3042813Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.3043210Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.3043528Z             )
2025-05-07T20:33:29.3043721Z         else:
2025-05-07T20:33:29.3043924Z             scale_ub_tensor = None
2025-05-07T20:33:29.3044177Z     
2025-05-07T20:33:29.3044404Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.3044715Z             op = silu_mul_quant
2025-05-07T20:33:29.3044968Z             if compiled:
2025-05-07T20:33:29.3045215Z                 op = torch.compile(op)
2025-05-07T20:33:29.3045511Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.3045788Z     
2025-05-07T20:33:29.3045972Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:29.3046136Z 
2025-05-07T20:33:29.3046238Z moe/activation_test.py:117: 
2025-05-07T20:33:29.3046534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.3046876Z moe/activation_test.py:115: in fn
2025-05-07T20:33:29.3047156Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.3047730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:29.3048315Z     return fn(*args, **kwargs)
2025-05-07T20:33:29.3048997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:29.3049717Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:29.3050267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.3050978Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.3051673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.3052224Z     kernel = self.compile(
2025-05-07T20:33:29.3052786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.3053474Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.3053883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.3054121Z 
2025-05-07T20:33:29.3054331Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e54d8cb0>
2025-05-07T20:33:29.3055571Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.3057055Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e54859e0>}
2025-05-07T20:33:29.3058505Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.3059595Z context = <triton._C.libtriton.ir.context object at 0x7f08e498e030>
2025-05-07T20:33:29.3059933Z 
2025-05-07T20:33:29.3060103Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.3060644Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.3061125Z                            module_map=module_map)
2025-05-07T20:33:29.3061496Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.3061859Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:29.3062120Z E       ^
2025-05-07T20:33:29.3062605Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.3063081Z 
2025-05-07T20:33:29.3063520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.3064116Z 
2025-05-07T20:33:29.3064262Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.3064691Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.3065103Z     T=2048,
2025-05-07T20:33:29.3065292Z     D=5120,
2025-05-07T20:33:29.3065482Z     scale_ub=1200.0,
2025-05-07T20:33:29.3065700Z     contiguous=False,
2025-05-07T20:33:29.3065920Z     compiled=True,
2025-05-07T20:33:29.3066114Z )
2025-05-07T20:33:29.3066430Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.3066936Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:29.3067220Z 
2025-05-07T20:33:29.3067295Z     @given(
2025-05-07T20:33:29.3067517Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.3067830Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.3068140Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.3068474Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.3068801Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.3069094Z     )
2025-05-07T20:33:29.3069441Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.3069891Z     def test_silu_mul_quant(
2025-05-07T20:33:29.3070121Z         self,
2025-05-07T20:33:29.3070312Z         T: int,
2025-05-07T20:33:29.3070506Z         D: int,
2025-05-07T20:33:29.3070716Z         scale_ub: Optional[float],
2025-05-07T20:33:29.3070989Z         contiguous: bool,
2025-05-07T20:33:29.3071223Z         compiled: bool,
2025-05-07T20:33:29.3071434Z     ) -> None:
2025-05-07T20:33:29.3071645Z         torch.manual_seed(2025)
2025-05-07T20:33:29.3071885Z     
2025-05-07T20:33:29.3072156Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.3072509Z     
2025-05-07T20:33:29.3072699Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.3072981Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.3073301Z         x = x_sign * x_clamp
2025-05-07T20:33:29.3073541Z         x0 = x[:, :D]
2025-05-07T20:33:29.3073747Z         x1 = x[:, D:]
2025-05-07T20:33:29.3073950Z     
2025-05-07T20:33:29.3074127Z         if contiguous:
2025-05-07T20:33:29.3074355Z             x0 = x0.contiguous()
2025-05-07T20:33:29.3074608Z             x1 = x1.contiguous()
2025-05-07T20:33:29.3074852Z     
2025-05-07T20:33:29.3075039Z         if scale_ub is not None:
2025-05-07T20:33:29.3075305Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.3075698Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.3076015Z             )
2025-05-07T20:33:29.3076201Z         else:
2025-05-07T20:33:29.3076405Z             scale_ub_tensor = None
2025-05-07T20:33:29.3076659Z     
2025-05-07T20:33:29.3076922Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.3077247Z             op = silu_mul_quant
2025-05-07T20:33:29.3077506Z             if compiled:
2025-05-07T20:33:29.3077755Z                 op = torch.compile(op)
2025-05-07T20:33:29.3078099Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.3078383Z     
2025-05-07T20:33:29.3078576Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:29.3078747Z 
2025-05-07T20:33:29.3078849Z moe/activation_test.py:117: 
2025-05-07T20:33:29.3079148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.3079495Z moe/activation_test.py:115: in fn
2025-05-07T20:33:29.3079781Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.3080366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:29.3080956Z     return fn(*args, **kwargs)
2025-05-07T20:33:29.3081646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:29.3082374Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:29.3082981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.3083706Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.3084400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.3084963Z     kernel = self.compile(
2025-05-07T20:33:29.3085530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.3086222Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.3086635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.3086877Z 
2025-05-07T20:33:29.3087089Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e54db080>
2025-05-07T20:33:29.3088212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.3089650Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e5486b60>}
2025-05-07T20:33:29.3091052Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.3092143Z context = <triton._C.libtriton.ir.context object at 0x7f08e4913df0>
2025-05-07T20:33:29.3092451Z 
2025-05-07T20:33:29.3092621Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.3093165Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.3093645Z                            module_map=module_map)
2025-05-07T20:33:29.3094068Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.3094503Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:29.3094763Z E       ^
2025-05-07T20:33:29.3095242Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.3095723Z 
2025-05-07T20:33:29.3096162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.3096807Z 
2025-05-07T20:33:29.4845870Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.4846529Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.4847096Z     T=4096,
2025-05-07T20:33:29.4847351Z     D=5120,
2025-05-07T20:33:29.4847757Z     scale_ub=1200.0,
2025-05-07T20:33:29.4847977Z     contiguous=True,
2025-05-07T20:33:29.4854833Z     compiled=True,
2025-05-07T20:33:29.4855072Z )
2025-05-07T20:33:29.4855419Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.4856060Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:29.4856346Z 
2025-05-07T20:33:29.4856433Z     @given(
2025-05-07T20:33:29.4856678Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.4856997Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.4857304Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.4857636Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.4857967Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.4858266Z     )
2025-05-07T20:33:29.4858614Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.4859068Z     def test_silu_mul_quant(
2025-05-07T20:33:29.4859311Z         self,
2025-05-07T20:33:29.4859504Z         T: int,
2025-05-07T20:33:29.4859700Z         D: int,
2025-05-07T20:33:29.4859986Z         scale_ub: Optional[float],
2025-05-07T20:33:29.4860258Z         contiguous: bool,
2025-05-07T20:33:29.4860504Z         compiled: bool,
2025-05-07T20:33:29.4860728Z     ) -> None:
2025-05-07T20:33:29.4860945Z         torch.manual_seed(2025)
2025-05-07T20:33:29.4861200Z     
2025-05-07T20:33:29.4861484Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.4861848Z     
2025-05-07T20:33:29.4862050Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.4862348Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.4862676Z         x = x_sign * x_clamp
2025-05-07T20:33:29.4862920Z         x0 = x[:, :D]
2025-05-07T20:33:29.4863143Z         x1 = x[:, D:]
2025-05-07T20:33:29.4863362Z     
2025-05-07T20:33:29.4863554Z         if contiguous:
2025-05-07T20:33:29.4863797Z             x0 = x0.contiguous()
2025-05-07T20:33:29.4864069Z             x1 = x1.contiguous()
2025-05-07T20:33:29.4864314Z     
2025-05-07T20:33:29.4864514Z         if scale_ub is not None:
2025-05-07T20:33:29.4864804Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.4865144Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.4865463Z             )
2025-05-07T20:33:29.4865664Z         else:
2025-05-07T20:33:29.4865902Z             scale_ub_tensor = None
2025-05-07T20:33:29.4866163Z     
2025-05-07T20:33:29.4866401Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.4866716Z             op = silu_mul_quant
2025-05-07T20:33:29.4866962Z             if compiled:
2025-05-07T20:33:29.4867204Z                 op = torch.compile(op)
2025-05-07T20:33:29.4867501Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.4867774Z     
2025-05-07T20:33:29.4867962Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:29.4868124Z 
2025-05-07T20:33:29.4868227Z moe/activation_test.py:117: 
2025-05-07T20:33:29.4868522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.4868853Z moe/activation_test.py:115: in fn
2025-05-07T20:33:29.4869136Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.4869715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:29.4870292Z     return fn(*args, **kwargs)
2025-05-07T20:33:29.4870971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:29.4871688Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:29.4872309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.4873012Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.4873736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.4874295Z     kernel = self.compile(
2025-05-07T20:33:29.4874851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.4875577Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.4875985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.4876221Z 
2025-05-07T20:33:29.4876437Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4927ad0>
2025-05-07T20:33:29.4877557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.4878989Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4590180>}
2025-05-07T20:33:29.4880432Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.4881522Z context = <triton._C.libtriton.ir.context object at 0x7f08e45d6d30>
2025-05-07T20:33:29.4881824Z 
2025-05-07T20:33:29.4882004Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.4882543Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.4883028Z                            module_map=module_map)
2025-05-07T20:33:29.4883403Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.4883763Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:29.4884029Z E       ^
2025-05-07T20:33:29.4884549Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.4885028Z 
2025-05-07T20:33:29.4885471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.4886016Z 
2025-05-07T20:33:29.4886124Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.4886555Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.4886978Z     T=128,
2025-05-07T20:33:29.4887167Z     D=5120,
2025-05-07T20:33:29.4887362Z     scale_ub=1200.0,
2025-05-07T20:33:29.4887588Z     contiguous=False,
2025-05-07T20:33:29.4887813Z     compiled=True,
2025-05-07T20:33:29.4888021Z )
2025-05-07T20:33:29.7737263Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.7737827Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:29.7738154Z 
2025-05-07T20:33:29.7738265Z     @given(
2025-05-07T20:33:29.7738600Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.7739025Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.7739422Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.7739833Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.7740174Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.7740458Z     )
2025-05-07T20:33:29.7740810Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.7741265Z     def test_silu_mul_quant(
2025-05-07T20:33:29.7741542Z         self,
2025-05-07T20:33:29.7741739Z         T: int,
2025-05-07T20:33:29.7741934Z         D: int,
2025-05-07T20:33:29.7742319Z         scale_ub: Optional[float],
2025-05-07T20:33:29.7742599Z         contiguous: bool,
2025-05-07T20:33:29.7742838Z         compiled: bool,
2025-05-07T20:33:29.7743064Z     ) -> None:
2025-05-07T20:33:29.7743285Z         torch.manual_seed(2025)
2025-05-07T20:33:29.7743528Z     
2025-05-07T20:33:29.7743883Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.7744243Z     
2025-05-07T20:33:29.7744439Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.7744733Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.7745121Z         x = x_sign * x_clamp
2025-05-07T20:33:29.7745363Z         x0 = x[:, :D]
2025-05-07T20:33:29.7745573Z         x1 = x[:, D:]
2025-05-07T20:33:29.7745788Z     
2025-05-07T20:33:29.7745978Z         if contiguous:
2025-05-07T20:33:29.7746209Z             x0 = x0.contiguous()
2025-05-07T20:33:29.7746470Z             x1 = x1.contiguous()
2025-05-07T20:33:29.7746719Z     
2025-05-07T20:33:29.7746906Z         if scale_ub is not None:
2025-05-07T20:33:29.7747190Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.7747538Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.7747848Z             )
2025-05-07T20:33:29.7748044Z         else:
2025-05-07T20:33:29.7748267Z             scale_ub_tensor = None
2025-05-07T20:33:29.7748518Z     
2025-05-07T20:33:29.7748752Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.7749080Z             op = silu_mul_quant
2025-05-07T20:33:29.7749423Z             if compiled:
2025-05-07T20:33:29.7749677Z                 op = torch.compile(op)
2025-05-07T20:33:29.7749981Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.7750268Z     
2025-05-07T20:33:29.7750455Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:29.7750630Z 
2025-05-07T20:33:29.7750729Z moe/activation_test.py:117: 
2025-05-07T20:33:29.7751039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.7751379Z moe/activation_test.py:115: in fn
2025-05-07T20:33:29.7751671Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.7752259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:29.7752851Z     return fn(*args, **kwargs)
2025-05-07T20:33:29.7753538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:29.7754327Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:29.7754888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.7755603Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.7756300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.7756860Z     kernel = self.compile(
2025-05-07T20:33:29.7757431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.7758119Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.7758530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.7758776Z 
2025-05-07T20:33:29.7758994Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4925850>
2025-05-07T20:33:29.7760130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.7761565Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4590ea0>}
2025-05-07T20:33:29.7762978Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.7764125Z context = <triton._C.libtriton.ir.context object at 0x7f08e429aaf0>
2025-05-07T20:33:29.7764430Z 
2025-05-07T20:33:29.7764681Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.7765229Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.7765726Z                            module_map=module_map)
2025-05-07T20:33:29.7766156Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.7766536Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:29.7766802Z E       ^
2025-05-07T20:33:29.7767284Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.7767764Z 
2025-05-07T20:33:29.7768215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.7768768Z 
2025-05-07T20:33:29.7768878Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.7769313Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.7769743Z     T=16384,
2025-05-07T20:33:29.7769955Z     D=7168,
2025-05-07T20:33:29.7770154Z     scale_ub=1200.0,
2025-05-07T20:33:29.7770389Z     contiguous=True,
2025-05-07T20:33:29.7770619Z     compiled=True,
2025-05-07T20:33:29.7770875Z )
2025-05-07T20:33:29.7771222Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.7771750Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:29.7772039Z 
2025-05-07T20:33:29.7772119Z     @given(
2025-05-07T20:33:29.7772356Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.7772689Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.7773016Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.7773356Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.7773700Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.7774009Z     )
2025-05-07T20:33:29.7774375Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.7774969Z     def test_silu_mul_quant(
2025-05-07T20:33:29.7775224Z         self,
2025-05-07T20:33:29.7775417Z         T: int,
2025-05-07T20:33:29.7775619Z         D: int,
2025-05-07T20:33:29.7775843Z         scale_ub: Optional[float],
2025-05-07T20:33:29.7776129Z         contiguous: bool,
2025-05-07T20:33:29.7776374Z         compiled: bool,
2025-05-07T20:33:29.7776606Z     ) -> None:
2025-05-07T20:33:29.7776825Z         torch.manual_seed(2025)
2025-05-07T20:33:29.7777061Z     
2025-05-07T20:33:29.7777347Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.7777706Z     
2025-05-07T20:33:29.7777896Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.7778190Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.7778509Z         x = x_sign * x_clamp
2025-05-07T20:33:29.7778749Z         x0 = x[:, :D]
2025-05-07T20:33:29.7778964Z         x1 = x[:, D:]
2025-05-07T20:33:29.7779173Z     
2025-05-07T20:33:29.7779354Z         if contiguous:
2025-05-07T20:33:29.7779586Z             x0 = x0.contiguous()
2025-05-07T20:33:29.7779848Z             x1 = x1.contiguous()
2025-05-07T20:33:29.7780092Z     
2025-05-07T20:33:29.7780280Z         if scale_ub is not None:
2025-05-07T20:33:29.7780559Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.7780898Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.7781206Z             )
2025-05-07T20:33:29.7781401Z         else:
2025-05-07T20:33:29.7781615Z             scale_ub_tensor = None
2025-05-07T20:33:29.7781867Z     
2025-05-07T20:33:29.7782103Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.7782479Z             op = silu_mul_quant
2025-05-07T20:33:29.7782731Z             if compiled:
2025-05-07T20:33:29.7782982Z                 op = torch.compile(op)
2025-05-07T20:33:29.7783282Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.7783557Z     
2025-05-07T20:33:29.7783750Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:29.7783969Z 
2025-05-07T20:33:29.7784087Z moe/activation_test.py:117: 
2025-05-07T20:33:29.7784416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.7784751Z moe/activation_test.py:115: in fn
2025-05-07T20:33:29.7785084Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.7785666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:29.7786251Z     return fn(*args, **kwargs)
2025-05-07T20:33:29.7786948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:29.7787679Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:29.7788242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.7788954Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.7789659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.7790223Z     kernel = self.compile(
2025-05-07T20:33:29.7790838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.7791542Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.7791964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.7792213Z 
2025-05-07T20:33:29.7792438Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4926f90>
2025-05-07T20:33:29.7793571Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.7795004Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e45920c0>}
2025-05-07T20:33:29.7796426Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.7797524Z context = <triton._C.libtriton.ir.context object at 0x7f08e4143ef0>
2025-05-07T20:33:29.7797828Z 
2025-05-07T20:33:29.7798009Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.7798555Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.7799051Z                            module_map=module_map)
2025-05-07T20:33:29.7799435Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.7799798Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:29.7800072Z E       ^
2025-05-07T20:33:29.7800563Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.7801038Z 
2025-05-07T20:33:29.7801488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.7802039Z 
2025-05-07T20:33:29.9036859Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.9037356Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.9037934Z     T=16384,
2025-05-07T20:33:29.9038133Z     D=5120,
2025-05-07T20:33:29.9038322Z     scale_ub=1200.0,
2025-05-07T20:33:29.9038542Z     contiguous=True,
2025-05-07T20:33:29.9038885Z     compiled=False,
2025-05-07T20:33:29.9039080Z )
2025-05-07T20:33:29.9039404Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.9039923Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:29.9040211Z 
2025-05-07T20:33:29.9040352Z     @given(
2025-05-07T20:33:29.9040583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.9040900Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.9041214Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.9041605Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.9041936Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.9042232Z     )
2025-05-07T20:33:29.9042580Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.9043036Z     def test_silu_mul_quant(
2025-05-07T20:33:29.9043285Z         self,
2025-05-07T20:33:29.9043472Z         T: int,
2025-05-07T20:33:29.9043668Z         D: int,
2025-05-07T20:33:29.9043896Z         scale_ub: Optional[float],
2025-05-07T20:33:29.9044174Z         contiguous: bool,
2025-05-07T20:33:29.9044423Z         compiled: bool,
2025-05-07T20:33:29.9044652Z     ) -> None:
2025-05-07T20:33:29.9044862Z         torch.manual_seed(2025)
2025-05-07T20:33:29.9045105Z     
2025-05-07T20:33:29.9045379Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.9045726Z     
2025-05-07T20:33:29.9046020Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.9046310Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.9046625Z         x = x_sign * x_clamp
2025-05-07T20:33:29.9046862Z         x0 = x[:, :D]
2025-05-07T20:33:29.9047071Z         x1 = x[:, D:]
2025-05-07T20:33:29.9047281Z     
2025-05-07T20:33:29.9047468Z         if contiguous:
2025-05-07T20:33:29.9047696Z             x0 = x0.contiguous()
2025-05-07T20:33:29.9047962Z             x1 = x1.contiguous()
2025-05-07T20:33:29.9048204Z     
2025-05-07T20:33:29.9048397Z         if scale_ub is not None:
2025-05-07T20:33:29.9048669Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.9049007Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.9049320Z             )
2025-05-07T20:33:29.9049508Z         else:
2025-05-07T20:33:29.9049716Z             scale_ub_tensor = None
2025-05-07T20:33:29.9049968Z     
2025-05-07T20:33:29.9050194Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.9050517Z             op = silu_mul_quant
2025-05-07T20:33:29.9050772Z             if compiled:
2025-05-07T20:33:29.9051013Z                 op = torch.compile(op)
2025-05-07T20:33:29.9051313Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.9051594Z     
2025-05-07T20:33:29.9051781Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:29.9051952Z 
2025-05-07T20:33:29.9052048Z moe/activation_test.py:117: 
2025-05-07T20:33:29.9052346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.9052686Z moe/activation_test.py:115: in fn
2025-05-07T20:33:29.9052963Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.9053691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:29.9054508Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:29.9055070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.9055797Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.9056490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.9057047Z     kernel = self.compile(
2025-05-07T20:33:29.9057605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.9058350Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.9058768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.9059006Z 
2025-05-07T20:33:29.9059232Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e515cdd0>
2025-05-07T20:33:29.9060395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.9061867Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4591a80>}
2025-05-07T20:33:29.9063276Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.9064373Z context = <triton._C.libtriton.ir.context object at 0x7f08e41a2df0>
2025-05-07T20:33:29.9064672Z 
2025-05-07T20:33:29.9064842Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.9065394Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.9065884Z                            module_map=module_map)
2025-05-07T20:33:29.9066312Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.9066674Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:29.9066952Z E       ^
2025-05-07T20:33:29.9067442Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.9067914Z 
2025-05-07T20:33:29.9068354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.9068907Z 
2025-05-07T20:33:29.9069018Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.9069448Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.9069871Z     T=1,
2025-05-07T20:33:29.9070063Z     D=7168,
2025-05-07T20:33:29.9070254Z     scale_ub=1200.0,
2025-05-07T20:33:29.9070481Z     contiguous=False,
2025-05-07T20:33:29.9070701Z     compiled=False,
2025-05-07T20:33:29.9070908Z )
2025-05-07T20:33:29.9071235Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.9071732Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:29.9072016Z 
2025-05-07T20:33:29.9072090Z     @given(
2025-05-07T20:33:29.9072316Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.9072631Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.9072938Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.9073272Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.9073608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.9073891Z     )
2025-05-07T20:33:29.9074244Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.9074699Z     def test_silu_mul_quant(
2025-05-07T20:33:29.9074938Z         self,
2025-05-07T20:33:29.9075135Z         T: int,
2025-05-07T20:33:29.9075331Z         D: int,
2025-05-07T20:33:29.9075543Z         scale_ub: Optional[float],
2025-05-07T20:33:29.9075823Z         contiguous: bool,
2025-05-07T20:33:29.9076066Z         compiled: bool,
2025-05-07T20:33:29.9076282Z     ) -> None:
2025-05-07T20:33:29.9076496Z         torch.manual_seed(2025)
2025-05-07T20:33:29.9076746Z     
2025-05-07T20:33:29.9077016Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.9077376Z     
2025-05-07T20:33:29.9077570Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.9077865Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.9079757Z         x = x_sign * x_clamp
2025-05-07T20:33:29.9079999Z         x0 = x[:, :D]
2025-05-07T20:33:29.9080218Z         x1 = x[:, D:]
2025-05-07T20:33:29.9080417Z     
2025-05-07T20:33:29.9080606Z         if contiguous:
2025-05-07T20:33:29.9080842Z             x0 = x0.contiguous()
2025-05-07T20:33:29.9081140Z             x1 = x1.contiguous()
2025-05-07T20:33:29.9081382Z     
2025-05-07T20:33:29.9081575Z         if scale_ub is not None:
2025-05-07T20:33:29.9081847Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.9082193Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.9082549Z             )
2025-05-07T20:33:29.9090374Z         else:
2025-05-07T20:33:29.9090611Z             scale_ub_tensor = None
2025-05-07T20:33:29.9090884Z     
2025-05-07T20:33:29.9091128Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.9091469Z             op = silu_mul_quant
2025-05-07T20:33:29.9091733Z             if compiled:
2025-05-07T20:33:29.9091993Z                 op = torch.compile(op)
2025-05-07T20:33:29.9092301Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.9092590Z     
2025-05-07T20:33:29.9092784Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:29.9092959Z 
2025-05-07T20:33:29.9093063Z moe/activation_test.py:117: 
2025-05-07T20:33:29.9093376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.9093722Z moe/activation_test.py:115: in fn
2025-05-07T20:33:29.9094096Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.9094902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:29.9095646Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:29.9096208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.9096931Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.9097634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.9098200Z     kernel = self.compile(
2025-05-07T20:33:29.9098761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.9099459Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.9099879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.9100117Z 
2025-05-07T20:33:29.9100335Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e41360f0>
2025-05-07T20:33:29.9101470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.9102903Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e41ac0e0>}
2025-05-07T20:33:29.9104314Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.9105403Z context = <triton._C.libtriton.ir.context object at 0x7f08e43133b0>
2025-05-07T20:33:29.9105704Z 
2025-05-07T20:33:29.9105876Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.9106416Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.9106901Z                            module_map=module_map)
2025-05-07T20:33:29.9107279Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.9107639Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:29.9107915Z E       ^
2025-05-07T20:33:29.9108456Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.9108931Z 
2025-05-07T20:33:29.9109370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.9109924Z 
2025-05-07T20:33:30.0841021Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.0841496Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.0841949Z     T=4096,
2025-05-07T20:33:30.0842641Z     D=7168,
2025-05-07T20:33:30.0843507Z     scale_ub=1200.0,
2025-05-07T20:33:30.0844033Z     contiguous=False,
2025-05-07T20:33:30.0844357Z     compiled=True,
2025-05-07T20:33:30.0844578Z )
2025-05-07T20:33:30.0844911Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.0845443Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:30.0845734Z 
2025-05-07T20:33:30.0845828Z     @given(
2025-05-07T20:33:30.0846063Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.0846381Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.0846695Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.0847039Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.0847371Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.0847672Z     )
2025-05-07T20:33:30.0848105Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.0848566Z     def test_silu_mul_quant(
2025-05-07T20:33:30.0848812Z         self,
2025-05-07T20:33:30.0849011Z         T: int,
2025-05-07T20:33:30.0849202Z         D: int,
2025-05-07T20:33:30.0849426Z         scale_ub: Optional[float],
2025-05-07T20:33:30.0849702Z         contiguous: bool,
2025-05-07T20:33:30.0849939Z         compiled: bool,
2025-05-07T20:33:30.0850169Z     ) -> None:
2025-05-07T20:33:30.0850395Z         torch.manual_seed(2025)
2025-05-07T20:33:30.0850649Z     
2025-05-07T20:33:30.0850924Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.0851278Z     
2025-05-07T20:33:30.0851475Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.0851770Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.0852093Z         x = x_sign * x_clamp
2025-05-07T20:33:30.0852342Z         x0 = x[:, :D]
2025-05-07T20:33:30.0852554Z         x1 = x[:, D:]
2025-05-07T20:33:30.0852765Z     
2025-05-07T20:33:30.0852966Z         if contiguous:
2025-05-07T20:33:30.0853195Z             x0 = x0.contiguous()
2025-05-07T20:33:30.0853461Z             x1 = x1.contiguous()
2025-05-07T20:33:30.0853708Z     
2025-05-07T20:33:30.0853895Z         if scale_ub is not None:
2025-05-07T20:33:30.0854219Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:30.0854649Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:30.0854962Z             )
2025-05-07T20:33:30.0855163Z         else:
2025-05-07T20:33:30.0855379Z             scale_ub_tensor = None
2025-05-07T20:33:30.0855638Z     
2025-05-07T20:33:30.0855876Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.0856192Z             op = silu_mul_quant
2025-05-07T20:33:30.0856451Z             if compiled:
2025-05-07T20:33:30.0856698Z                 op = torch.compile(op)
2025-05-07T20:33:30.0857007Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.0857293Z     
2025-05-07T20:33:30.0857485Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:30.0857672Z 
2025-05-07T20:33:30.0857776Z moe/activation_test.py:117: 
2025-05-07T20:33:30.0858077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.0858426Z moe/activation_test.py:115: in fn
2025-05-07T20:33:30.0858709Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.0859294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:30.0859959Z     return fn(*args, **kwargs)
2025-05-07T20:33:30.0860641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:30.0861369Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:30.0861978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:30.0862702Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:30.0863394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:30.0863996Z     kernel = self.compile(
2025-05-07T20:33:30.0864563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:30.0865250Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:30.0865668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.0865913Z 
2025-05-07T20:33:30.0866126Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4135cd0>
2025-05-07T20:33:30.0867262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:30.0868747Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e41ad300>}
2025-05-07T20:33:30.0870162Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:30.0871258Z context = <triton._C.libtriton.ir.context object at 0x7f08e4315930>
2025-05-07T20:33:30.0871568Z 
2025-05-07T20:33:30.0871739Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:30.0872282Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:30.0872763Z                            module_map=module_map)
2025-05-07T20:33:30.0873144Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:30.0873511Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:30.0873778Z E       ^
2025-05-07T20:33:30.0874272Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:30.0874751Z 
2025-05-07T20:33:30.0875190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:30.0875734Z 
2025-05-07T20:33:30.0875846Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.0876274Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.0876694Z     T=128,
2025-05-07T20:33:30.0877065Z     D=7168,
2025-05-07T20:33:30.0877263Z     scale_ub=1200.0,
2025-05-07T20:33:30.0877497Z     contiguous=False,
2025-05-07T20:33:30.0877731Z     compiled=True,
2025-05-07T20:33:30.0877935Z )
2025-05-07T20:33:30.1790054Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.1790902Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:30.1791315Z 
2025-05-07T20:33:30.1791442Z     @given(
2025-05-07T20:33:30.1791774Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.1792190Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.1792508Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.1792849Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.1793179Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.1793474Z     )
2025-05-07T20:33:30.1793948Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.1794402Z     def test_silu_mul_quant(
2025-05-07T20:33:30.1794646Z         self,
2025-05-07T20:33:30.1794847Z         T: int,
2025-05-07T20:33:30.1795049Z         D: int,
2025-05-07T20:33:30.1795336Z         scale_ub: Optional[float],
2025-05-07T20:33:30.1795616Z         contiguous: bool,
2025-05-07T20:33:30.1795857Z         compiled: bool,
2025-05-07T20:33:30.1796088Z     ) -> None:
2025-05-07T20:33:30.1796309Z         torch.manual_seed(2025)
2025-05-07T20:33:30.1796547Z     
2025-05-07T20:33:30.1796893Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.1797255Z     
2025-05-07T20:33:30.1797458Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.1797753Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.1798080Z         x = x_sign * x_clamp
2025-05-07T20:33:30.1798326Z         x0 = x[:, :D]
2025-05-07T20:33:30.1798543Z         x1 = x[:, D:]
2025-05-07T20:33:30.1798766Z     
2025-05-07T20:33:30.1798962Z         if contiguous:
2025-05-07T20:33:30.1799198Z             x0 = x0.contiguous()
2025-05-07T20:33:30.1799468Z             x1 = x1.contiguous()
2025-05-07T20:33:30.1799717Z     
2025-05-07T20:33:30.1799912Z         if scale_ub is not None:
2025-05-07T20:33:30.1800202Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:30.1800552Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:30.1800870Z             )
2025-05-07T20:33:30.1801139Z         else:
2025-05-07T20:33:30.1801363Z             scale_ub_tensor = None
2025-05-07T20:33:30.1801620Z     
2025-05-07T20:33:30.1801864Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.1802191Z             op = silu_mul_quant
2025-05-07T20:33:30.1802452Z             if compiled:
2025-05-07T20:33:30.1802701Z                 op = torch.compile(op)
2025-05-07T20:33:30.1803017Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.1803306Z     
2025-05-07T20:33:30.1803514Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:30.1803683Z 
2025-05-07T20:33:30.1803786Z moe/activation_test.py:117: 
2025-05-07T20:33:30.1804091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.1804438Z moe/activation_test.py:115: in fn
2025-05-07T20:33:30.1804727Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.1805317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:30.1805907Z     return fn(*args, **kwargs)
2025-05-07T20:33:30.1806603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:30.1807326Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:30.1807885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:30.1808603Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:30.1809300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:30.1809861Z     kernel = self.compile(
2025-05-07T20:33:30.1810429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:30.1811119Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:30.1811529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.1811778Z 
2025-05-07T20:33:30.1811992Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4135910>
2025-05-07T20:33:30.1813115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:30.1814796Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e41ae020>}
2025-05-07T20:33:30.1816262Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:30.1817356Z context = <triton._C.libtriton.ir.context object at 0x7f08e4335230>
2025-05-07T20:33:30.1817662Z 
2025-05-07T20:33:30.1817877Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:30.1818430Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:30.1818919Z                            module_map=module_map)
2025-05-07T20:33:30.1819300Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:30.1819673Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:30.1819953Z E       ^
2025-05-07T20:33:30.1820440Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:30.1820920Z 
2025-05-07T20:33:30.1821370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:30.1821919Z 
2025-05-07T20:33:30.1822028Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.1822571Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.1822994Z     T=2048,
2025-05-07T20:33:30.1823194Z     D=7168,
2025-05-07T20:33:30.1823394Z     scale_ub=None,
2025-05-07T20:33:30.1823613Z     contiguous=True,
2025-05-07T20:33:30.1823849Z     compiled=True,
2025-05-07T20:33:30.1824067Z )
2025-05-07T20:33:30.1824441Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.1824956Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:30.1825237Z 
2025-05-07T20:33:30.1825327Z     @given(
2025-05-07T20:33:30.1825754Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.1826077Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.1826400Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.1826743Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.1827083Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.1827386Z     )
2025-05-07T20:33:30.1827755Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.1828218Z     def test_silu_mul_quant(
2025-05-07T20:33:30.1828472Z         self,
2025-05-07T20:33:30.1828677Z         T: int,
2025-05-07T20:33:30.1828880Z         D: int,
2025-05-07T20:33:30.1829103Z         scale_ub: Optional[float],
2025-05-07T20:33:30.1829381Z         contiguous: bool,
2025-05-07T20:33:30.1829620Z         compiled: bool,
2025-05-07T20:33:30.1829853Z     ) -> None:
2025-05-07T20:33:30.1830075Z         torch.manual_seed(2025)
2025-05-07T20:33:30.1830320Z     
2025-05-07T20:33:30.1830600Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.1830954Z     
2025-05-07T20:33:30.1831153Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.1831446Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.1831762Z         x = x_sign * x_clamp
2025-05-07T20:33:30.1832010Z         x0 = x[:, :D]
2025-05-07T20:33:30.1832228Z         x1 = x[:, D:]
2025-05-07T20:33:30.1832437Z     
2025-05-07T20:33:30.1832625Z         if contiguous:
2025-05-07T20:33:30.1832855Z             x0 = x0.contiguous()
2025-05-07T20:33:30.1833118Z             x1 = x1.contiguous()
2025-05-07T20:33:30.1833364Z     
2025-05-07T20:33:30.1833555Z         if scale_ub is not None:
2025-05-07T20:33:30.1833836Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:30.1834177Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:30.1834589Z             )
2025-05-07T20:33:30.1834785Z         else:
2025-05-07T20:33:30.1835000Z             scale_ub_tensor = None
2025-05-07T20:33:30.1835280Z     
2025-05-07T20:33:30.1835507Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.1835831Z             op = silu_mul_quant
2025-05-07T20:33:30.1836148Z             if compiled:
2025-05-07T20:33:30.1836395Z                 op = torch.compile(op)
2025-05-07T20:33:30.1836705Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.1836989Z     
2025-05-07T20:33:30.1837178Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:30.1837410Z 
2025-05-07T20:33:30.1837510Z moe/activation_test.py:117: 
2025-05-07T20:33:30.1837809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.1838152Z moe/activation_test.py:115: in fn
2025-05-07T20:33:30.1838438Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.1839022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:30.1839613Z     return fn(*args, **kwargs)
2025-05-07T20:33:30.1840298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:30.1841029Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:30.1841591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:30.1842371Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:30.1843068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:30.1843628Z     kernel = self.compile(
2025-05-07T20:33:30.1844197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:30.1844883Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:30.1845296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.1845541Z 
2025-05-07T20:33:30.1845759Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e435c890>
2025-05-07T20:33:30.1846896Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:30.1848326Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e41af240>}
2025-05-07T20:33:30.1849731Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:30.1850824Z context = <triton._C.libtriton.ir.context object at 0x7f08e420c170>
2025-05-07T20:33:30.1851126Z 
2025-05-07T20:33:30.1851296Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:30.1851832Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:30.1852318Z                            module_map=module_map)
2025-05-07T20:33:30.1852695Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:30.1853062Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:30.1853322Z E       ^
2025-05-07T20:33:30.1853812Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:30.1854286Z 
2025-05-07T20:33:30.1854787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:30.1855329Z 
2025-05-07T20:33:30.2501968Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.2502751Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.2503331Z     T=16384,
2025-05-07T20:33:30.2503597Z     D=5120,
2025-05-07T20:33:30.2503860Z     scale_ub=None,
2025-05-07T20:33:30.2504153Z     contiguous=False,
2025-05-07T20:33:30.2504403Z     compiled=False,
2025-05-07T20:33:30.2504964Z )
2025-05-07T20:33:30.2505593Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.2506616Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:30.2507307Z 
2025-05-07T20:33:30.2507471Z     @given(
2025-05-07T20:33:30.2507911Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.2508541Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.2509141Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.2509794Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.2510437Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.2511012Z     )
2025-05-07T20:33:30.2511709Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.2512596Z     def test_silu_mul_quant(
2025-05-07T20:33:30.2513068Z         self,
2025-05-07T20:33:30.2513442Z         T: int,
2025-05-07T20:33:30.2513822Z         D: int,
2025-05-07T20:33:30.2514249Z         scale_ub: Optional[float],
2025-05-07T20:33:30.2514642Z         contiguous: bool,
2025-05-07T20:33:30.2514876Z         compiled: bool,
2025-05-07T20:33:30.2515160Z     ) -> None:
2025-05-07T20:33:30.2515377Z         torch.manual_seed(2025)
2025-05-07T20:33:30.2515618Z     
2025-05-07T20:33:30.2515893Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.2516246Z     
2025-05-07T20:33:30.2516439Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.2516723Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.2518880Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.2520898Z 
2025-05-07T20:33:30.2521017Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:30.2521242Z 
2025-05-07T20:33:30.2521344Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.2521785Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.2522202Z     T=4096,
2025-05-07T20:33:30.2522383Z     D=7168,
2025-05-07T20:33:30.2522570Z     scale_ub=1200.0,
2025-05-07T20:33:30.2522790Z     contiguous=True,
2025-05-07T20:33:30.2523008Z     compiled=True,
2025-05-07T20:33:30.2523212Z )
2025-05-07T20:33:30.2523536Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.2524038Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:30.2524326Z 
2025-05-07T20:33:30.2524408Z     @given(
2025-05-07T20:33:30.2524634Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.2524955Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.2525265Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.2525795Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.2526126Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.2526409Z     )
2025-05-07T20:33:30.2526757Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.2527208Z     def test_silu_mul_quant(
2025-05-07T20:33:30.2527446Z         self,
2025-05-07T20:33:30.2527708Z         T: int,
2025-05-07T20:33:30.2527902Z         D: int,
2025-05-07T20:33:30.2528118Z         scale_ub: Optional[float],
2025-05-07T20:33:30.2528384Z         contiguous: bool,
2025-05-07T20:33:30.2528621Z         compiled: bool,
2025-05-07T20:33:30.2528838Z     ) -> None:
2025-05-07T20:33:30.2529045Z         torch.manual_seed(2025)
2025-05-07T20:33:30.2529368Z     
2025-05-07T20:33:30.2529643Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.2529981Z     
2025-05-07T20:33:30.2530174Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.2530462Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.2532660Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.2534777Z 
2025-05-07T20:33:30.2534906Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:30.2535158Z 
2025-05-07T20:33:30.2535268Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.2535806Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.2536218Z     T=16384,
2025-05-07T20:33:30.2536405Z     D=7168,
2025-05-07T20:33:30.2536593Z     scale_ub=None,
2025-05-07T20:33:30.2536802Z     contiguous=False,
2025-05-07T20:33:30.2537021Z     compiled=False,
2025-05-07T20:33:30.2537223Z )
2025-05-07T20:33:30.2537549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.2538054Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:30.2538346Z 
2025-05-07T20:33:30.2538425Z     @given(
2025-05-07T20:33:30.2538651Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.2538967Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.2545827Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.2546213Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.2546561Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.2546858Z     )
2025-05-07T20:33:30.2547217Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.2547685Z     def test_silu_mul_quant(
2025-05-07T20:33:30.2547935Z         self,
2025-05-07T20:33:30.2548132Z         T: int,
2025-05-07T20:33:30.2548332Z         D: int,
2025-05-07T20:33:30.2548556Z         scale_ub: Optional[float],
2025-05-07T20:33:30.2548827Z         contiguous: bool,
2025-05-07T20:33:30.2549086Z         compiled: bool,
2025-05-07T20:33:30.2549321Z     ) -> None:
2025-05-07T20:33:30.2549549Z         torch.manual_seed(2025)
2025-05-07T20:33:30.2549787Z     
2025-05-07T20:33:30.2550068Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.2552285Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.2554307Z 
2025-05-07T20:33:30.2554433Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:30.2554650Z 
2025-05-07T20:33:30.2554755Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.2555252Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.2555670Z     T=2048,
2025-05-07T20:33:30.2555864Z     D=7168,
2025-05-07T20:33:30.2556055Z     scale_ub=1200.0,
2025-05-07T20:33:30.2556283Z     contiguous=True,
2025-05-07T20:33:30.2556513Z     compiled=True,
2025-05-07T20:33:30.2556717Z )
2025-05-07T20:33:30.2557082Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.2557601Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:30.2557881Z 
2025-05-07T20:33:30.2558003Z     @given(
2025-05-07T20:33:30.2558234Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.2558561Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.2558868Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.2559201Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.2559537Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.2559835Z     )
2025-05-07T20:33:30.2560184Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.2560644Z     def test_silu_mul_quant(
2025-05-07T20:33:30.2560888Z         self,
2025-05-07T20:33:30.2561078Z         T: int,
2025-05-07T20:33:30.2561278Z         D: int,
2025-05-07T20:33:30.2561502Z         scale_ub: Optional[float],
2025-05-07T20:33:30.2561773Z         contiguous: bool,
2025-05-07T20:33:30.2562025Z         compiled: bool,
2025-05-07T20:33:30.2562295Z     ) -> None:
2025-05-07T20:33:30.2562509Z         torch.manual_seed(2025)
2025-05-07T20:33:30.2562758Z     
2025-05-07T20:33:30.2563035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.2563383Z     
2025-05-07T20:33:30.2563581Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.2563874Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.2566056Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.2568036Z 
2025-05-07T20:33:30.2568166Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:30.2568387Z 
2025-05-07T20:33:30.2568490Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.2568917Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.2569334Z     T=2048,
2025-05-07T20:33:30.2569524Z     D=7168,
2025-05-07T20:33:30.2569721Z     scale_ub=None,
2025-05-07T20:33:30.2569937Z     contiguous=True,
2025-05-07T20:33:30.2570161Z     compiled=False,
2025-05-07T20:33:30.2570369Z )
2025-05-07T20:33:30.3692153Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.3692688Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:30.3693077Z 
2025-05-07T20:33:30.3693193Z     @given(
2025-05-07T20:33:30.3693466Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.3693792Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.3694195Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.3695052Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.3695721Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.3696297Z     )
2025-05-07T20:33:30.3696995Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.3697908Z     def test_silu_mul_quant(
2025-05-07T20:33:30.3698400Z         self,
2025-05-07T20:33:30.3698778Z         T: int,
2025-05-07T20:33:30.3699365Z         D: int,
2025-05-07T20:33:30.3699785Z         scale_ub: Optional[float],
2025-05-07T20:33:30.3700314Z         contiguous: bool,
2025-05-07T20:33:30.3700782Z         compiled: bool,
2025-05-07T20:33:30.3701211Z     ) -> None:
2025-05-07T20:33:30.3701617Z         torch.manual_seed(2025)
2025-05-07T20:33:30.3702085Z     
2025-05-07T20:33:30.3702782Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.3703478Z     
2025-05-07T20:33:30.3703849Z >       x_sign = torch.sign(x)
2025-05-07T20:33:30.3705977Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.3708036Z 
2025-05-07T20:33:30.3708154Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:30.3708380Z 
2025-05-07T20:33:30.3708490Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.3708917Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.3709335Z     T=1,
2025-05-07T20:33:30.3709520Z     D=7168,
2025-05-07T20:33:30.3709768Z     scale_ub=1200.0,
2025-05-07T20:33:30.3709985Z     contiguous=True,
2025-05-07T20:33:30.3710203Z     compiled=False,
2025-05-07T20:33:30.3710405Z )
2025-05-07T20:33:30.3710719Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.3711213Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:30.3711492Z 
2025-05-07T20:33:30.3711569Z     @given(
2025-05-07T20:33:30.3711792Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.3712104Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.3712414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.3712745Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.3713071Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.3713366Z     )
2025-05-07T20:33:30.3713710Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.3714186Z     def test_silu_mul_quant(
2025-05-07T20:33:30.3714450Z         self,
2025-05-07T20:33:30.3714644Z         T: int,
2025-05-07T20:33:30.3714834Z         D: int,
2025-05-07T20:33:30.3715048Z         scale_ub: Optional[float],
2025-05-07T20:33:30.3715321Z         contiguous: bool,
2025-05-07T20:33:30.3715552Z         compiled: bool,
2025-05-07T20:33:30.3715774Z     ) -> None:
2025-05-07T20:33:30.3715984Z         torch.manual_seed(2025)
2025-05-07T20:33:30.3716220Z     
2025-05-07T20:33:30.3716483Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.3716838Z     
2025-05-07T20:33:30.3717024Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.3717306Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.3717613Z         x = x_sign * x_clamp
2025-05-07T20:33:30.3717848Z         x0 = x[:, :D]
2025-05-07T20:33:30.3718056Z         x1 = x[:, D:]
2025-05-07T20:33:30.3718259Z     
2025-05-07T20:33:30.3718435Z         if contiguous:
2025-05-07T20:33:30.3718661Z             x0 = x0.contiguous()
2025-05-07T20:33:30.3718920Z             x1 = x1.contiguous()
2025-05-07T20:33:30.3719167Z     
2025-05-07T20:33:30.3719351Z         if scale_ub is not None:
2025-05-07T20:33:30.3719619Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:30.3719948Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:30.3720252Z             )
2025-05-07T20:33:30.3720441Z         else:
2025-05-07T20:33:30.3720651Z             scale_ub_tensor = None
2025-05-07T20:33:30.3720961Z     
2025-05-07T20:33:30.3721186Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.3721504Z             op = silu_mul_quant
2025-05-07T20:33:30.3721751Z             if compiled:
2025-05-07T20:33:30.3721991Z                 op = torch.compile(op)
2025-05-07T20:33:30.3722333Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.3722609Z     
2025-05-07T20:33:30.3722798Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:30.3722968Z 
2025-05-07T20:33:30.3723070Z moe/activation_test.py:117: 
2025-05-07T20:33:30.3723365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.3723740Z moe/activation_test.py:115: in fn
2025-05-07T20:33:30.3724024Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.3724740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:30.3725631Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:30.3726187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:30.3726904Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:30.3727607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:30.3728155Z     kernel = self.compile(
2025-05-07T20:33:30.3728786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:30.3729482Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:30.3729895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.3730132Z 
2025-05-07T20:33:30.3730345Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e435fef0>
2025-05-07T20:33:30.3731474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:30.3732907Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4296520>}
2025-05-07T20:33:30.3734319Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:30.3735463Z context = <triton._C.libtriton.ir.context object at 0x7f08e4487cf0>
2025-05-07T20:33:30.3735761Z 
2025-05-07T20:33:30.3735930Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:30.3736475Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:30.3736959Z                            module_map=module_map)
2025-05-07T20:33:30.3737328Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:30.3737700Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:30.3737971Z E       ^
2025-05-07T20:33:30.3738454Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:30.3738925Z 
2025-05-07T20:33:30.3739366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:30.3739914Z 
2025-05-07T20:33:30.3740026Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.3740451Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.3740865Z     T=128,
2025-05-07T20:33:30.3741053Z     D=5120,
2025-05-07T20:33:30.3741240Z     scale_ub=None,
2025-05-07T20:33:30.3741448Z     contiguous=True,
2025-05-07T20:33:30.3741663Z     compiled=False,
2025-05-07T20:33:30.3741928Z )
2025-05-07T20:33:30.4414041Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.4414780Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:30.4415180Z 
2025-05-07T20:33:30.4415296Z     @given(
2025-05-07T20:33:30.4415727Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.4416084Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.4416392Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.4416732Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.4417160Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.4417446Z     )
2025-05-07T20:33:30.4417794Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.4418245Z     def test_silu_mul_quant(
2025-05-07T20:33:30.4418481Z         self,
2025-05-07T20:33:30.4418674Z         T: int,
2025-05-07T20:33:30.4418867Z         D: int,
2025-05-07T20:33:30.4419081Z         scale_ub: Optional[float],
2025-05-07T20:33:30.4419352Z         contiguous: bool,
2025-05-07T20:33:30.4419585Z         compiled: bool,
2025-05-07T20:33:30.4419800Z     ) -> None:
2025-05-07T20:33:30.4420013Z         torch.manual_seed(2025)
2025-05-07T20:33:30.4420256Z     
2025-05-07T20:33:30.4420536Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.4420882Z     
2025-05-07T20:33:30.4421072Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.4421427Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.4421744Z         x = x_sign * x_clamp
2025-05-07T20:33:30.4421980Z         x0 = x[:, :D]
2025-05-07T20:33:30.4422193Z         x1 = x[:, D:]
2025-05-07T20:33:30.4422395Z     
2025-05-07T20:33:30.4422580Z         if contiguous:
2025-05-07T20:33:30.4422812Z             x0 = x0.contiguous()
2025-05-07T20:33:30.4423063Z             x1 = x1.contiguous()
2025-05-07T20:33:30.4423306Z     
2025-05-07T20:33:30.4423502Z         if scale_ub is not None:
2025-05-07T20:33:30.4423775Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:30.4424110Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:30.4424418Z             )
2025-05-07T20:33:30.4424603Z         else:
2025-05-07T20:33:30.4424810Z             scale_ub_tensor = None
2025-05-07T20:33:30.4425064Z     
2025-05-07T20:33:30.4425291Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.4425918Z             op = silu_mul_quant
2025-05-07T20:33:30.4426174Z             if compiled:
2025-05-07T20:33:30.4426423Z                 op = torch.compile(op)
2025-05-07T20:33:30.4426719Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.4426993Z     
2025-05-07T20:33:30.4427187Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:30.4427352Z 
2025-05-07T20:33:30.4427447Z moe/activation_test.py:117: 
2025-05-07T20:33:30.4427740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.4428082Z moe/activation_test.py:115: in fn
2025-05-07T20:33:30.4428354Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.4429071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:30.4429795Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:30.4430352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:30.4431061Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:30.4431761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:30.4432323Z     kernel = self.compile(
2025-05-07T20:33:30.4432887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:30.4433566Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:30.4434052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.4434291Z 
2025-05-07T20:33:30.4434509Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e44ea0f0>
2025-05-07T20:33:30.4435686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:30.4437114Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4297420>}
2025-05-07T20:33:30.4438578Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:30.4439670Z context = <triton._C.libtriton.ir.context object at 0x7f08e44d1ff0>
2025-05-07T20:33:30.4439976Z 
2025-05-07T20:33:30.4440150Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:30.4440685Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:30.4441169Z                            module_map=module_map)
2025-05-07T20:33:30.4441540Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:30.4441957Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:30.4442231Z E       ^
2025-05-07T20:33:30.4442717Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:30.4443187Z 
2025-05-07T20:33:30.4443630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:30.4444171Z 
2025-05-07T20:33:30.4444273Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.4444706Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.4445124Z     T=128,
2025-05-07T20:33:30.4445316Z     D=7168,
2025-05-07T20:33:30.4445508Z     scale_ub=None,
2025-05-07T20:33:30.4445726Z     contiguous=True,
2025-05-07T20:33:30.4445956Z     compiled=False,
2025-05-07T20:33:30.4446157Z )
2025-05-07T20:33:30.4446484Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.4447002Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:30.4447284Z 
2025-05-07T20:33:30.4447369Z     @given(
2025-05-07T20:33:30.4447601Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.4447923Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.4448233Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.4448568Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.4448909Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.4449206Z     )
2025-05-07T20:33:30.4449560Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.4450014Z     def test_silu_mul_quant(
2025-05-07T20:33:30.4450261Z         self,
2025-05-07T20:33:30.4450455Z         T: int,
2025-05-07T20:33:30.4450648Z         D: int,
2025-05-07T20:33:30.4450872Z         scale_ub: Optional[float],
2025-05-07T20:33:30.4451142Z         contiguous: bool,
2025-05-07T20:33:30.4451385Z         compiled: bool,
2025-05-07T20:33:30.4451608Z     ) -> None:
2025-05-07T20:33:30.4451814Z         torch.manual_seed(2025)
2025-05-07T20:33:30.4452055Z     
2025-05-07T20:33:30.4452332Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.4452676Z     
2025-05-07T20:33:30.4452868Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.4453158Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.4453468Z         x = x_sign * x_clamp
2025-05-07T20:33:30.4453709Z         x0 = x[:, :D]
2025-05-07T20:33:30.4453977Z         x1 = x[:, D:]
2025-05-07T20:33:30.4454179Z     
2025-05-07T20:33:30.4454359Z         if contiguous:
2025-05-07T20:33:30.4454664Z             x0 = x0.contiguous()
2025-05-07T20:33:30.4454921Z             x1 = x1.contiguous()
2025-05-07T20:33:30.4455155Z     
2025-05-07T20:33:30.4455387Z         if scale_ub is not None:
2025-05-07T20:33:30.4455663Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:30.4455996Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:30.4456307Z             )
2025-05-07T20:33:30.4456541Z         else:
2025-05-07T20:33:30.4456745Z             scale_ub_tensor = None
2025-05-07T20:33:30.4456995Z     
2025-05-07T20:33:30.4457225Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.4457536Z             op = silu_mul_quant
2025-05-07T20:33:30.4457794Z             if compiled:
2025-05-07T20:33:30.4458039Z                 op = torch.compile(op)
2025-05-07T20:33:30.4458331Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.4458611Z     
2025-05-07T20:33:30.4458796Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:30.4458958Z 
2025-05-07T20:33:30.4459059Z moe/activation_test.py:117: 
2025-05-07T20:33:30.4459348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.4459692Z moe/activation_test.py:115: in fn
2025-05-07T20:33:30.4459976Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.4460766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:30.4461505Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:30.4462070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:30.4462789Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:30.4463481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:30.4464041Z     kernel = self.compile(
2025-05-07T20:33:30.4464604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:30.4465310Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:30.4465715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.4465952Z 
2025-05-07T20:33:30.4466170Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e44eaf30>
2025-05-07T20:33:30.4467293Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:30.4468722Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e44dc4a0>}
2025-05-07T20:33:30.4470128Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:30.4471224Z context = <triton._C.libtriton.ir.context object at 0x7f0527edffb0>
2025-05-07T20:33:30.4471530Z 
2025-05-07T20:33:30.4471701Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:30.4472251Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:30.4472735Z                            module_map=module_map)
2025-05-07T20:33:30.4473112Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:30.4473480Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:30.4473742Z E       ^
2025-05-07T20:33:30.4474223Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:30.4474747Z 
2025-05-07T20:33:30.4475186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:30.4475730Z 
2025-05-07T20:33:30.4475846Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.4476308Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.4476727Z     T=2048,
2025-05-07T20:33:30.4476917Z     D=7168,
2025-05-07T20:33:30.4477115Z     scale_ub=1200.0,
2025-05-07T20:33:30.4477341Z     contiguous=True,
2025-05-07T20:33:30.4477606Z     compiled=False,
2025-05-07T20:33:30.4477811Z )
2025-05-07T20:33:30.5287498Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.5288249Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:30.5288673Z 
2025-05-07T20:33:30.5288787Z     @given(
2025-05-07T20:33:30.5289111Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.5289549Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.5289985Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.5290437Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.5290858Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.5291236Z     )
2025-05-07T20:33:30.5291600Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.5292172Z     def test_silu_mul_quant(
2025-05-07T20:33:30.5292410Z         self,
2025-05-07T20:33:30.5292610Z         T: int,
2025-05-07T20:33:30.5292803Z         D: int,
2025-05-07T20:33:30.5293011Z         scale_ub: Optional[float],
2025-05-07T20:33:30.5293282Z         contiguous: bool,
2025-05-07T20:33:30.5293517Z         compiled: bool,
2025-05-07T20:33:30.5293733Z     ) -> None:
2025-05-07T20:33:30.5293943Z         torch.manual_seed(2025)
2025-05-07T20:33:30.5294181Z     
2025-05-07T20:33:30.5294607Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.5296839Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.5298837Z 
2025-05-07T20:33:30.5298954Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:30.5299170Z 
2025-05-07T20:33:30.5299269Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.5299686Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.5300093Z     T=1,
2025-05-07T20:33:30.5300276Z     D=5120,
2025-05-07T20:33:30.5300466Z     scale_ub=1200.0,
2025-05-07T20:33:30.5300679Z     contiguous=True,
2025-05-07T20:33:30.5300893Z     compiled=False,
2025-05-07T20:33:30.5301092Z )
2025-05-07T20:33:30.5301399Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.5301893Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:30.5302162Z 
2025-05-07T20:33:30.5302244Z     @given(
2025-05-07T20:33:30.5302479Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.5302786Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.5303090Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.5303422Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.5303746Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.5304034Z     )
2025-05-07T20:33:30.5311442Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.5312053Z     def test_silu_mul_quant(
2025-05-07T20:33:30.5312308Z         self,
2025-05-07T20:33:30.5312499Z         T: int,
2025-05-07T20:33:30.5312698Z         D: int,
2025-05-07T20:33:30.5312918Z         scale_ub: Optional[float],
2025-05-07T20:33:30.5313193Z         contiguous: bool,
2025-05-07T20:33:30.5313505Z         compiled: bool,
2025-05-07T20:33:30.5313741Z     ) -> None:
2025-05-07T20:33:30.5313958Z         torch.manual_seed(2025)
2025-05-07T20:33:30.5314204Z     
2025-05-07T20:33:30.5314485Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.5314904Z     
2025-05-07T20:33:30.5315103Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.5315391Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.5315704Z         x = x_sign * x_clamp
2025-05-07T20:33:30.5315943Z         x0 = x[:, :D]
2025-05-07T20:33:30.5316165Z         x1 = x[:, D:]
2025-05-07T20:33:30.5316368Z     
2025-05-07T20:33:30.5316557Z         if contiguous:
2025-05-07T20:33:30.5316794Z             x0 = x0.contiguous()
2025-05-07T20:33:30.5317054Z             x1 = x1.contiguous()
2025-05-07T20:33:30.5317299Z     
2025-05-07T20:33:30.5317495Z         if scale_ub is not None:
2025-05-07T20:33:30.5317767Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:30.5318107Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:30.5318419Z             )
2025-05-07T20:33:30.5318611Z         else:
2025-05-07T20:33:30.5318863Z             scale_ub_tensor = None
2025-05-07T20:33:30.5319113Z     
2025-05-07T20:33:30.5319346Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.5319659Z             op = silu_mul_quant
2025-05-07T20:33:30.5319905Z             if compiled:
2025-05-07T20:33:30.5320147Z                 op = torch.compile(op)
2025-05-07T20:33:30.5320440Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.5320719Z     
2025-05-07T20:33:30.5320904Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:30.5321071Z 
2025-05-07T20:33:30.5321172Z moe/activation_test.py:117: 
2025-05-07T20:33:30.5321474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.5321809Z moe/activation_test.py:115: in fn
2025-05-07T20:33:30.5322092Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.5322807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:30.5323530Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:30.5324090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:30.5324807Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:30.5325699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:30.5326262Z     kernel = self.compile(
2025-05-07T20:33:30.5326833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:30.5327513Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:30.5327922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.5328159Z 
2025-05-07T20:33:30.5328375Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e44eae70>
2025-05-07T20:33:30.5329499Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:30.5330921Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e44dda80>}
2025-05-07T20:33:30.5332319Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:30.5333480Z context = <triton._C.libtriton.ir.context object at 0x7f0527e6b3b0>
2025-05-07T20:33:30.5333780Z 
2025-05-07T20:33:30.5334017Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:30.5334676Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:30.5335166Z                            module_map=module_map)
2025-05-07T20:33:30.5335614Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:30.5335986Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:30.5336251Z E       ^
2025-05-07T20:33:30.5336745Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:30.5337221Z 
2025-05-07T20:33:30.5337661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:30.5338203Z 
2025-05-07T20:33:30.5338311Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.5338728Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.5339143Z     T=2048,
2025-05-07T20:33:30.5339334Z     D=5120,
2025-05-07T20:33:30.5339526Z     scale_ub=None,
2025-05-07T20:33:30.5339745Z     contiguous=True,
2025-05-07T20:33:30.5339979Z     compiled=False,
2025-05-07T20:33:30.5340242Z )
2025-05-07T20:33:30.5340567Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.5341083Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:30.5341364Z 
2025-05-07T20:33:30.5341454Z     @given(
2025-05-07T20:33:30.5341682Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.5342001Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.5342314Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.5342651Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.5342983Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.5343272Z     )
2025-05-07T20:33:30.5343623Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.5344082Z     def test_silu_mul_quant(
2025-05-07T20:33:30.5344327Z         self,
2025-05-07T20:33:30.5344525Z         T: int,
2025-05-07T20:33:30.5344723Z         D: int,
2025-05-07T20:33:30.5344948Z         scale_ub: Optional[float],
2025-05-07T20:33:30.5345225Z         contiguous: bool,
2025-05-07T20:33:30.5345463Z         compiled: bool,
2025-05-07T20:33:30.5345687Z     ) -> None:
2025-05-07T20:33:30.5345896Z         torch.manual_seed(2025)
2025-05-07T20:33:30.5346132Z     
2025-05-07T20:33:30.5346402Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.5346750Z     
2025-05-07T20:33:30.5346937Z >       x_sign = torch.sign(x)
2025-05-07T20:33:30.5350399Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.5352391Z 
2025-05-07T20:33:30.5352509Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:30.5352731Z 
2025-05-07T20:33:30.5352830Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.5353249Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.5353665Z     T=16384,
2025-05-07T20:33:30.5353865Z     D=5120,
2025-05-07T20:33:30.5354122Z     scale_ub=None,
2025-05-07T20:33:30.5354368Z     contiguous=True,
2025-05-07T20:33:30.5354599Z     compiled=False,
2025-05-07T20:33:30.5354805Z )
2025-05-07T20:33:30.6101922Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.6102538Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:30.6102994Z 
2025-05-07T20:33:30.6103108Z     @given(
2025-05-07T20:33:30.6103428Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.6103868Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.6104271Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.6104606Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.6104935Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.6105224Z     )
2025-05-07T20:33:30.6105574Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.6106030Z     def test_silu_mul_quant(
2025-05-07T20:33:30.6106266Z         self,
2025-05-07T20:33:30.6106455Z         T: int,
2025-05-07T20:33:30.6106645Z         D: int,
2025-05-07T20:33:30.6106860Z         scale_ub: Optional[float],
2025-05-07T20:33:30.6107130Z         contiguous: bool,
2025-05-07T20:33:30.6107375Z         compiled: bool,
2025-05-07T20:33:30.6107597Z     ) -> None:
2025-05-07T20:33:30.6107800Z         torch.manual_seed(2025)
2025-05-07T20:33:30.6108040Z     
2025-05-07T20:33:30.6108383Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.6110574Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.6112570Z 
2025-05-07T20:33:30.6112695Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:30.6112912Z 
2025-05-07T20:33:30.6113018Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.6113436Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.6113848Z     T=4096,
2025-05-07T20:33:30.6114036Z     D=5120,
2025-05-07T20:33:30.6114229Z     scale_ub=None,
2025-05-07T20:33:30.6114447Z     contiguous=True,
2025-05-07T20:33:30.6114663Z     compiled=False,
2025-05-07T20:33:30.6114867Z )
2025-05-07T20:33:30.6115191Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.6115693Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:30.6115980Z 
2025-05-07T20:33:30.6116059Z     @given(
2025-05-07T20:33:30.6116288Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.6116602Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.6116903Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.6117235Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.6117578Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.6117863Z     )
2025-05-07T20:33:30.6118209Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.6118666Z     def test_silu_mul_quant(
2025-05-07T20:33:30.6118907Z         self,
2025-05-07T20:33:30.6119098Z         T: int,
2025-05-07T20:33:30.6119289Z         D: int,
2025-05-07T20:33:30.6119500Z         scale_ub: Optional[float],
2025-05-07T20:33:30.6119777Z         contiguous: bool,
2025-05-07T20:33:30.6120018Z         compiled: bool,
2025-05-07T20:33:30.6120241Z     ) -> None:
2025-05-07T20:33:30.6120444Z         torch.manual_seed(2025)
2025-05-07T20:33:30.6120758Z     
2025-05-07T20:33:30.6121034Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.6123238Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.6125294Z 
2025-05-07T20:33:30.6125578Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:30.6125803Z 
2025-05-07T20:33:30.6125907Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.6126325Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.6126746Z     T=2048,
2025-05-07T20:33:30.6126935Z     D=5120,
2025-05-07T20:33:30.6127125Z     scale_ub=None,
2025-05-07T20:33:30.6127332Z     contiguous=False,
2025-05-07T20:33:30.6127549Z     compiled=False,
2025-05-07T20:33:30.6127745Z )
2025-05-07T20:33:30.6128068Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.6128565Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:30.6128849Z 
2025-05-07T20:33:30.6128995Z     @given(
2025-05-07T20:33:30.6129223Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.6129533Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.6129842Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.6130171Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.6130504Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.6130788Z     )
2025-05-07T20:33:30.6131133Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.6131585Z     def test_silu_mul_quant(
2025-05-07T20:33:30.6131820Z         self,
2025-05-07T20:33:30.6132010Z         T: int,
2025-05-07T20:33:30.6132207Z         D: int,
2025-05-07T20:33:30.6132423Z         scale_ub: Optional[float],
2025-05-07T20:33:30.6132699Z         contiguous: bool,
2025-05-07T20:33:30.6132931Z         compiled: bool,
2025-05-07T20:33:30.6133147Z     ) -> None:
2025-05-07T20:33:30.6133355Z         torch.manual_seed(2025)
2025-05-07T20:33:30.6133592Z     
2025-05-07T20:33:30.6133856Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.6136132Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.6138117Z 
2025-05-07T20:33:30.6138236Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:30.6138453Z 
2025-05-07T20:33:30.6138552Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.6138969Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.6139382Z     T=4096,
2025-05-07T20:33:30.6139585Z     D=7168,
2025-05-07T20:33:30.6139769Z     scale_ub=None,
2025-05-07T20:33:30.6139987Z     contiguous=True,
2025-05-07T20:33:30.6140212Z     compiled=True,
2025-05-07T20:33:30.6140411Z )
2025-05-07T20:33:30.6140730Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.6141235Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:30.6141580Z 
2025-05-07T20:33:30.6141658Z     @given(
2025-05-07T20:33:30.6141888Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.6142205Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.6142512Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.6142916Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.6143254Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.6143541Z     )
2025-05-07T20:33:30.6143892Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.6144410Z     def test_silu_mul_quant(
2025-05-07T20:33:30.6144659Z         self,
2025-05-07T20:33:30.6144856Z         T: int,
2025-05-07T20:33:30.6145055Z         D: int,
2025-05-07T20:33:30.6145275Z         scale_ub: Optional[float],
2025-05-07T20:33:30.6145550Z         contiguous: bool,
2025-05-07T20:33:30.6145792Z         compiled: bool,
2025-05-07T20:33:30.6146008Z     ) -> None:
2025-05-07T20:33:30.6146219Z         torch.manual_seed(2025)
2025-05-07T20:33:30.6146461Z     
2025-05-07T20:33:30.6146729Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.6148956Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.6150949Z 
2025-05-07T20:33:30.6151068Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:30.6151283Z 
2025-05-07T20:33:30.6151387Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.6151807Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.6152226Z     T=2048,
2025-05-07T20:33:30.6152413Z     D=5120,
2025-05-07T20:33:30.6152603Z     scale_ub=1200.0,
2025-05-07T20:33:30.6152816Z     contiguous=False,
2025-05-07T20:33:30.6153035Z     compiled=False,
2025-05-07T20:33:30.6153237Z )
2025-05-07T20:33:30.6153557Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.6154067Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:30.6154348Z 
2025-05-07T20:33:30.6154426Z     @given(
2025-05-07T20:33:30.6154655Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.6154976Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.6155288Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.6155628Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.6155964Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.6156248Z     )
2025-05-07T20:33:30.6156597Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.6157050Z     def test_silu_mul_quant(
2025-05-07T20:33:30.6157290Z         self,
2025-05-07T20:33:30.6157487Z         T: int,
2025-05-07T20:33:30.6157696Z         D: int,
2025-05-07T20:33:30.6157916Z         scale_ub: Optional[float],
2025-05-07T20:33:30.6158187Z         contiguous: bool,
2025-05-07T20:33:30.6158431Z         compiled: bool,
2025-05-07T20:33:30.6158651Z     ) -> None:
2025-05-07T20:33:30.6158856Z         torch.manual_seed(2025)
2025-05-07T20:33:30.6159097Z     
2025-05-07T20:33:30.6159368Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.6161536Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.6164137Z 
2025-05-07T20:33:30.6164257Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:30.6164479Z 
2025-05-07T20:33:30.6164582Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.6165007Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.6165517Z     T=4096,
2025-05-07T20:33:30.6165701Z     D=7168,
2025-05-07T20:33:30.6165894Z     scale_ub=1200.0,
2025-05-07T20:33:30.6166113Z     contiguous=True,
2025-05-07T20:33:30.6166338Z     compiled=False,
2025-05-07T20:33:30.6166546Z )
2025-05-07T20:33:30.7249149Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.7249898Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:30.7250312Z 
2025-05-07T20:33:30.7250420Z     @given(
2025-05-07T20:33:30.7250772Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.7251108Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.7251417Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.7251755Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.7252194Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.7252484Z     )
2025-05-07T20:33:30.7252830Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.7253282Z     def test_silu_mul_quant(
2025-05-07T20:33:30.7253526Z         self,
2025-05-07T20:33:30.7253713Z         T: int,
2025-05-07T20:33:30.7253906Z         D: int,
2025-05-07T20:33:30.7254123Z         scale_ub: Optional[float],
2025-05-07T20:33:30.7254392Z         contiguous: bool,
2025-05-07T20:33:30.7254731Z         compiled: bool,
2025-05-07T20:33:30.7254951Z     ) -> None:
2025-05-07T20:33:30.7255159Z         torch.manual_seed(2025)
2025-05-07T20:33:30.7255403Z     
2025-05-07T20:33:30.7255675Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.7257867Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.7259869Z 
2025-05-07T20:33:30.7259987Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:30.7260203Z 
2025-05-07T20:33:30.7260303Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.7260723Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.7261136Z     T=16384,
2025-05-07T20:33:30.7261322Z     D=7168,
2025-05-07T20:33:30.7261514Z     scale_ub=None,
2025-05-07T20:33:30.7261728Z     contiguous=False,
2025-05-07T20:33:30.7261949Z     compiled=True,
2025-05-07T20:33:30.7262151Z )
2025-05-07T20:33:30.7262469Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.7262975Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:30.7263266Z 
2025-05-07T20:33:30.7263344Z     @given(
2025-05-07T20:33:30.7263566Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.7263878Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.7264177Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.7264509Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.7264912Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.7265194Z     )
2025-05-07T20:33:30.7265543Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.7265998Z     def test_silu_mul_quant(
2025-05-07T20:33:30.7266234Z         self,
2025-05-07T20:33:30.7266489Z         T: int,
2025-05-07T20:33:30.7266686Z         D: int,
2025-05-07T20:33:30.7266911Z         scale_ub: Optional[float],
2025-05-07T20:33:30.7267184Z         contiguous: bool,
2025-05-07T20:33:30.7267434Z         compiled: bool,
2025-05-07T20:33:30.7267716Z     ) -> None:
2025-05-07T20:33:30.7267921Z         torch.manual_seed(2025)
2025-05-07T20:33:30.7268160Z     
2025-05-07T20:33:30.7268432Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.7270609Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.7272614Z 
2025-05-07T20:33:30.7272729Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:30.7272993Z 
2025-05-07T20:33:30.7273096Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.7273518Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.7273927Z     T=4096,
2025-05-07T20:33:30.7274106Z     D=7168,
2025-05-07T20:33:30.7274292Z     scale_ub=None,
2025-05-07T20:33:30.7274510Z     contiguous=True,
2025-05-07T20:33:30.7274730Z     compiled=False,
2025-05-07T20:33:30.7274932Z )
2025-05-07T20:33:30.7275250Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.7275751Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:30.7276037Z 
2025-05-07T20:33:30.7276113Z     @given(
2025-05-07T20:33:30.7276336Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.7276646Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.7276951Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.7277280Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.7277610Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.7277896Z     )
2025-05-07T20:33:30.7278243Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.7278699Z     def test_silu_mul_quant(
2025-05-07T20:33:30.7278937Z         self,
2025-05-07T20:33:30.7279131Z         T: int,
2025-05-07T20:33:30.7279320Z         D: int,
2025-05-07T20:33:30.7279529Z         scale_ub: Optional[float],
2025-05-07T20:33:30.7279804Z         contiguous: bool,
2025-05-07T20:33:30.7280046Z         compiled: bool,
2025-05-07T20:33:30.7280257Z     ) -> None:
2025-05-07T20:33:30.7280468Z         torch.manual_seed(2025)
2025-05-07T20:33:30.7280711Z     
2025-05-07T20:33:30.7280977Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.7283163Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.7285166Z 
2025-05-07T20:33:30.7285283Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:30.7285553Z 
2025-05-07T20:33:30.7285653Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.7286075Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.7286484Z     T=16384,
2025-05-07T20:33:30.7286674Z     D=7168,
2025-05-07T20:33:30.7286924Z     scale_ub=None,
2025-05-07T20:33:30.7287129Z     contiguous=True,
2025-05-07T20:33:30.7287346Z     compiled=False,
2025-05-07T20:33:30.7287544Z )
2025-05-07T20:33:30.7287856Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.7288401Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:30.7288695Z 
2025-05-07T20:33:30.7288770Z     @given(
2025-05-07T20:33:30.7288988Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.7289297Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.7289601Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.7289933Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.7290263Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.7290546Z     )
2025-05-07T20:33:30.7290898Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.7291359Z     def test_silu_mul_quant(
2025-05-07T20:33:30.7298533Z         self,
2025-05-07T20:33:30.7298750Z         T: int,
2025-05-07T20:33:30.7298949Z         D: int,
2025-05-07T20:33:30.7299250Z         scale_ub: Optional[float],
2025-05-07T20:33:30.7299537Z         contiguous: bool,
2025-05-07T20:33:30.7299776Z         compiled: bool,
2025-05-07T20:33:30.7300002Z     ) -> None:
2025-05-07T20:33:30.7300219Z         torch.manual_seed(2025)
2025-05-07T20:33:30.7300465Z     
2025-05-07T20:33:30.7300745Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.7302946Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.7304958Z 
2025-05-07T20:33:30.7305084Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:30.7305309Z 
2025-05-07T20:33:30.7305422Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.7305844Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.7306261Z     T=16384,
2025-05-07T20:33:30.7306456Z     D=7168,
2025-05-07T20:33:30.7306646Z     scale_ub=1200.0,
2025-05-07T20:33:30.7306875Z     contiguous=True,
2025-05-07T20:33:30.7307101Z     compiled=False,
2025-05-07T20:33:30.7307307Z )
2025-05-07T20:33:30.7307634Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.7308150Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:30.7308439Z 
2025-05-07T20:33:30.7308519Z     @given(
2025-05-07T20:33:30.7308748Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.7309067Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.7309382Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.7309712Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.7310049Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.7310341Z     )
2025-05-07T20:33:30.7310689Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.7311151Z     def test_silu_mul_quant(
2025-05-07T20:33:30.7311397Z         self,
2025-05-07T20:33:30.7311589Z         T: int,
2025-05-07T20:33:30.7311788Z         D: int,
2025-05-07T20:33:30.7312062Z         scale_ub: Optional[float],
2025-05-07T20:33:30.7312332Z         contiguous: bool,
2025-05-07T20:33:30.7312578Z         compiled: bool,
2025-05-07T20:33:30.7312806Z     ) -> None:
2025-05-07T20:33:30.7313021Z         torch.manual_seed(2025)
2025-05-07T20:33:30.7313269Z     
2025-05-07T20:33:30.7313592Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.7315788Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.7317833Z 
2025-05-07T20:33:30.7317960Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:30.7318181Z 
2025-05-07T20:33:30.7318288Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.7318721Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.7319152Z     T=128,
2025-05-07T20:33:30.7319344Z     D=5120,
2025-05-07T20:33:30.7319537Z     scale_ub=1200.0,
2025-05-07T20:33:30.7319763Z     contiguous=False,
2025-05-07T20:33:30.7320026Z     compiled=False,
2025-05-07T20:33:30.7320232Z )
2025-05-07T20:33:30.8618652Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.8619414Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:30.8619828Z 
2025-05-07T20:33:30.8619930Z     @given(
2025-05-07T20:33:30.8620161Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.8620480Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.8620794Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.8621135Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.8621471Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.8621765Z     )
2025-05-07T20:33:30.8622118Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.8622580Z     def test_silu_mul_quant(
2025-05-07T20:33:30.8622827Z         self,
2025-05-07T20:33:30.8623028Z         T: int,
2025-05-07T20:33:30.8623238Z         D: int,
2025-05-07T20:33:30.8623469Z         scale_ub: Optional[float],
2025-05-07T20:33:30.8623784Z         contiguous: bool,
2025-05-07T20:33:30.8624034Z         compiled: bool,
2025-05-07T20:33:30.8624282Z     ) -> None:
2025-05-07T20:33:30.8624534Z         torch.manual_seed(2025)
2025-05-07T20:33:30.8624777Z     
2025-05-07T20:33:30.8625052Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.8625582Z     
2025-05-07T20:33:30.8625781Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.8626076Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.8626395Z         x = x_sign * x_clamp
2025-05-07T20:33:30.8626631Z         x0 = x[:, :D]
2025-05-07T20:33:30.8626848Z         x1 = x[:, D:]
2025-05-07T20:33:30.8627059Z     
2025-05-07T20:33:30.8627243Z         if contiguous:
2025-05-07T20:33:30.8627483Z             x0 = x0.contiguous()
2025-05-07T20:33:30.8627748Z             x1 = x1.contiguous()
2025-05-07T20:33:30.8627993Z     
2025-05-07T20:33:30.8628179Z         if scale_ub is not None:
2025-05-07T20:33:30.8628466Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:30.8628804Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:30.8629116Z             )
2025-05-07T20:33:30.8629309Z         else:
2025-05-07T20:33:30.8629521Z             scale_ub_tensor = None
2025-05-07T20:33:30.8629773Z     
2025-05-07T20:33:30.8630006Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.8630444Z             op = silu_mul_quant
2025-05-07T20:33:30.8630696Z             if compiled:
2025-05-07T20:33:30.8630942Z                 op = torch.compile(op)
2025-05-07T20:33:30.8631246Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.8631526Z     
2025-05-07T20:33:30.8631779Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:30.8631946Z 
2025-05-07T20:33:30.8632051Z moe/activation_test.py:117: 
2025-05-07T20:33:30.8632356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.8632696Z moe/activation_test.py:115: in fn
2025-05-07T20:33:30.8633042Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.8633767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:30.8634489Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:30.8635050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:30.8635775Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:30.8636465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:30.8637022Z     kernel = self.compile(
2025-05-07T20:33:30.8637590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:30.8638339Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:30.8638747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.8638989Z 
2025-05-07T20:33:30.8639203Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e40fb620>
2025-05-07T20:33:30.8640330Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:30.8641758Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0527f147c0>}
2025-05-07T20:33:30.8643166Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:30.8644246Z context = <triton._C.libtriton.ir.context object at 0x7f0527f86db0>
2025-05-07T20:33:30.8644547Z 
2025-05-07T20:33:30.8644716Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:30.8645255Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:30.8645730Z                            module_map=module_map)
2025-05-07T20:33:30.8646086Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:30.8646448Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:30.8646711Z E       ^
2025-05-07T20:33:30.8647182Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:30.8647659Z 
2025-05-07T20:33:30.8648098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:30.8648644Z 
2025-05-07T20:33:30.8648746Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.8649167Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.8649574Z     T=2048,
2025-05-07T20:33:30.8649759Z     D=7168,
2025-05-07T20:33:30.8649954Z     scale_ub=None,
2025-05-07T20:33:30.8650162Z     contiguous=False,
2025-05-07T20:33:30.8650388Z     compiled=False,
2025-05-07T20:33:30.8650593Z )
2025-05-07T20:33:30.8650907Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.8651466Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:30.8651752Z 
2025-05-07T20:33:30.8651842Z     @given(
2025-05-07T20:33:30.8652070Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.8652381Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.8652729Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.8653065Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.8653389Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.8653713Z     )
2025-05-07T20:33:30.8654067Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.8654594Z     def test_silu_mul_quant(
2025-05-07T20:33:30.8654836Z         self,
2025-05-07T20:33:30.8655031Z         T: int,
2025-05-07T20:33:30.8655220Z         D: int,
2025-05-07T20:33:30.8655430Z         scale_ub: Optional[float],
2025-05-07T20:33:30.8655701Z         contiguous: bool,
2025-05-07T20:33:30.8655935Z         compiled: bool,
2025-05-07T20:33:30.8656155Z     ) -> None:
2025-05-07T20:33:30.8656360Z         torch.manual_seed(2025)
2025-05-07T20:33:30.8656594Z     
2025-05-07T20:33:30.8656865Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.8659102Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.8661103Z 
2025-05-07T20:33:30.8661222Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:30.8661439Z 
2025-05-07T20:33:30.8661550Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.8661973Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.8662395Z     T=128,
2025-05-07T20:33:30.8662588Z     D=7168,
2025-05-07T20:33:30.8662783Z     scale_ub=1200.0,
2025-05-07T20:33:30.8663007Z     contiguous=True,
2025-05-07T20:33:30.8663234Z     compiled=True,
2025-05-07T20:33:30.8663441Z )
2025-05-07T20:33:30.8974642Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.8976228Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:30.8977011Z 
2025-05-07T20:33:30.8977225Z     @given(
2025-05-07T20:33:30.8977825Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.8978455Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.8979059Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.8979728Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.8980397Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.8980965Z     )
2025-05-07T20:33:30.8981659Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.8982559Z     def test_silu_mul_quant(
2025-05-07T20:33:30.8983047Z         self,
2025-05-07T20:33:30.8983429Z         T: int,
2025-05-07T20:33:30.8983823Z         D: int,
2025-05-07T20:33:30.8984214Z         scale_ub: Optional[float],
2025-05-07T20:33:30.8984539Z         contiguous: bool,
2025-05-07T20:33:30.8984800Z         compiled: bool,
2025-05-07T20:33:30.8985024Z     ) -> None:
2025-05-07T20:33:30.8985235Z         torch.manual_seed(2025)
2025-05-07T20:33:30.8985481Z     
2025-05-07T20:33:30.8985749Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.8986099Z     
2025-05-07T20:33:30.8986288Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.8986576Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.8986995Z         x = x_sign * x_clamp
2025-05-07T20:33:30.8987234Z         x0 = x[:, :D]
2025-05-07T20:33:30.8987446Z         x1 = x[:, D:]
2025-05-07T20:33:30.8987653Z     
2025-05-07T20:33:30.8987839Z         if contiguous:
2025-05-07T20:33:30.8988069Z             x0 = x0.contiguous()
2025-05-07T20:33:30.8988416Z             x1 = x1.contiguous()
2025-05-07T20:33:30.8988666Z     
2025-05-07T20:33:30.8988857Z         if scale_ub is not None:
2025-05-07T20:33:30.8989131Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:30.8989468Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:30.8989863Z             )
2025-05-07T20:33:30.8990057Z         else:
2025-05-07T20:33:30.8990272Z             scale_ub_tensor = None
2025-05-07T20:33:30.8990525Z     
2025-05-07T20:33:30.8990750Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.8991072Z             op = silu_mul_quant
2025-05-07T20:33:30.8991324Z             if compiled:
2025-05-07T20:33:30.8991567Z                 op = torch.compile(op)
2025-05-07T20:33:30.8991868Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.8992146Z     
2025-05-07T20:33:30.8992332Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:30.8992500Z 
2025-05-07T20:33:30.8992600Z moe/activation_test.py:117: 
2025-05-07T20:33:30.8992905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.8993247Z moe/activation_test.py:115: in fn
2025-05-07T20:33:30.8993592Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.8994176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:30.8994761Z     return fn(*args, **kwargs)
2025-05-07T20:33:30.8995440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:30.8996166Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:30.8996727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:30.8997444Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:30.8998132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:30.8998683Z     kernel = self.compile(
2025-05-07T20:33:30.8999246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:30.8999928Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:30.9000336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.9000579Z 
2025-05-07T20:33:30.9000786Z self = <triton.compiler.compiler.ASTSource object at 0x7f0527f739e0>
2025-05-07T20:33:30.9001907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:30.9003340Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0527f15940>}
2025-05-07T20:33:30.9004746Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:30.9005828Z context = <triton._C.libtriton.ir.context object at 0x7f0527dd4170>
2025-05-07T20:33:30.9006132Z 
2025-05-07T20:33:30.9006302Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:30.9006838Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:30.9007313Z                            module_map=module_map)
2025-05-07T20:33:30.9007735Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:30.9008101Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:30.9008367Z E       ^
2025-05-07T20:33:30.9008891Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:30.9009370Z 
2025-05-07T20:33:30.9009808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:30.9010347Z 
2025-05-07T20:33:30.9010457Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.9010913Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.9011330Z     T=128,
2025-05-07T20:33:30.9011516Z     D=7168,
2025-05-07T20:33:30.9011710Z     scale_ub=1200.0,
2025-05-07T20:33:30.9011931Z     contiguous=True,
2025-05-07T20:33:30.9012146Z     compiled=False,
2025-05-07T20:33:30.9012346Z )
2025-05-07T20:33:30.9012665Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.9013170Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:30.9013448Z 
2025-05-07T20:33:30.9013529Z     @given(
2025-05-07T20:33:30.9013748Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.9014066Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.9014377Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.9014860Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.9015193Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.9015492Z     )
2025-05-07T20:33:30.9015839Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.9016290Z     def test_silu_mul_quant(
2025-05-07T20:33:30.9016531Z         self,
2025-05-07T20:33:30.9016726Z         T: int,
2025-05-07T20:33:30.9016920Z         D: int,
2025-05-07T20:33:30.9017133Z         scale_ub: Optional[float],
2025-05-07T20:33:30.9017402Z         contiguous: bool,
2025-05-07T20:33:30.9017635Z         compiled: bool,
2025-05-07T20:33:30.9017849Z     ) -> None:
2025-05-07T20:33:30.9018060Z         torch.manual_seed(2025)
2025-05-07T20:33:30.9018297Z     
2025-05-07T20:33:30.9018572Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.9018920Z     
2025-05-07T20:33:30.9019106Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.9019400Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.9021532Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.9023528Z 
2025-05-07T20:33:30.9023643Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:30.9023859Z 
2025-05-07T20:33:30.9023966Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.9024380Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.9024794Z     T=128,
2025-05-07T20:33:30.9024990Z     D=5120,
2025-05-07T20:33:30.9025180Z     scale_ub=1200.0,
2025-05-07T20:33:30.9025572Z     contiguous=True,
2025-05-07T20:33:30.9025803Z     compiled=True,
2025-05-07T20:33:30.9026006Z )
2025-05-07T20:33:30.9026336Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.9026844Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:30.9027123Z 
2025-05-07T20:33:30.9027204Z     @given(
2025-05-07T20:33:30.9027428Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.9027819Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.9028129Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.9028462Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.9028859Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.9029157Z     )
2025-05-07T20:33:30.9029508Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.9029965Z     def test_silu_mul_quant(
2025-05-07T20:33:30.9030216Z         self,
2025-05-07T20:33:30.9030468Z         T: int,
2025-05-07T20:33:30.9030667Z         D: int,
2025-05-07T20:33:30.9030882Z         scale_ub: Optional[float],
2025-05-07T20:33:30.9031156Z         contiguous: bool,
2025-05-07T20:33:30.9031397Z         compiled: bool,
2025-05-07T20:33:30.9031615Z     ) -> None:
2025-05-07T20:33:30.9031817Z         torch.manual_seed(2025)
2025-05-07T20:33:30.9032057Z     
2025-05-07T20:33:30.9032337Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.9032689Z     
2025-05-07T20:33:30.9032875Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.9033168Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.9035355Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:30.9037349Z 
2025-05-07T20:33:30.9037473Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:30.9037693Z 
2025-05-07T20:33:30.9037796Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.9038221Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.9038648Z     T=128,
2025-05-07T20:33:30.9038835Z     D=7168,
2025-05-07T20:33:30.9039015Z     scale_ub=None,
2025-05-07T20:33:30.9039225Z     contiguous=True,
2025-05-07T20:33:30.9039445Z     compiled=True,
2025-05-07T20:33:30.9039638Z )
2025-05-07T20:33:31.1704608Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.1705375Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:31.1705770Z 
2025-05-07T20:33:31.1705873Z     @given(
2025-05-07T20:33:31.1706117Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.1706438Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.1706758Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.1707106Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.1707448Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.1707755Z     )
2025-05-07T20:33:31.1708115Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.1708584Z     def test_silu_mul_quant(
2025-05-07T20:33:31.1708832Z         self,
2025-05-07T20:33:31.1709040Z         T: int,
2025-05-07T20:33:31.1709249Z         D: int,
2025-05-07T20:33:31.1709472Z         scale_ub: Optional[float],
2025-05-07T20:33:31.1709754Z         contiguous: bool,
2025-05-07T20:33:31.1710003Z         compiled: bool,
2025-05-07T20:33:31.1710229Z     ) -> None:
2025-05-07T20:33:31.1710447Z         torch.manual_seed(2025)
2025-05-07T20:33:31.1710697Z     
2025-05-07T20:33:31.1710973Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.1713276Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.1715323Z 
2025-05-07T20:33:31.1715444Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:31.1715670Z 
2025-05-07T20:33:31.1733150Z FAILED
2025-05-07T20:33:31.1733347Z 
2025-05-07T20:33:31.1733713Z =================================== FAILURES ===================================
2025-05-07T20:33:31.1734364Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:31.1735126Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:31.1735983Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:33:31.1736775Z   |     yield
2025-05-07T20:33:31.1737383Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run
2025-05-07T20:33:31.1738110Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:31.1738907Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
2025-05-07T20:33:31.1739721Z   |     if method() is not None:
2025-05-07T20:33:31.1740066Z   |        ^^^^^^^^
2025-05-07T20:33:31.1741095Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:31.1742159Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.1753064Z   |            ^^^^^^^
2025-05-07T20:33:31.1753841Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:31.1754803Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:31.1755420Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:31.1756032Z   +-+---------------- 1 ----------------
2025-05-07T20:33:31.1756453Z     | Traceback (most recent call last):
2025-05-07T20:33:31.1757502Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:31.1758632Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.1759426Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:31.1762287Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.1765164Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:31.1765799Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.1766414Z     |     T=2048,
2025-05-07T20:33:31.1766743Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:31.1767242Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:31.1767807Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:31.1768322Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:31.1768746Z     | )
2025-05-07T20:33:31.1769008Z     | 
2025-05-07T20:33:31.1769769Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:31.1770774Z     +---------------- 2 ----------------
2025-05-07T20:33:31.1771178Z     | Traceback (most recent call last):
2025-05-07T20:33:31.1772292Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:31.1773458Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.1773987Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:31.1777057Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.1779172Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:31.1779628Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.1780053Z     |     T=128,
2025-05-07T20:33:31.1780253Z     |     D=7168,
2025-05-07T20:33:31.1780466Z     |     scale_ub=None,
2025-05-07T20:33:31.1780761Z     |     contiguous=True,
2025-05-07T20:33:31.1781003Z     |     compiled=True,
2025-05-07T20:33:31.1781234Z     | )
2025-05-07T20:33:31.1781416Z     | 
2025-05-07T20:33:31.1781950Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:31.1782579Z     +---------------- 3 ----------------
2025-05-07T20:33:31.1782877Z     | Traceback (most recent call last):
2025-05-07T20:33:31.1783618Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:31.1784424Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.1784812Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:31.1786923Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.1789020Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:31.1789472Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.1789893Z     |     T=128,
2025-05-07T20:33:31.1790100Z     |     D=5120,
2025-05-07T20:33:31.1790317Z     |     scale_ub=1200.0,
2025-05-07T20:33:31.1790557Z     |     contiguous=True,
2025-05-07T20:33:31.1790805Z     |     compiled=True,
2025-05-07T20:33:31.1791035Z     | )
2025-05-07T20:33:31.1791212Z     | 
2025-05-07T20:33:31.1791757Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:31.1792392Z     +---------------- 4 ----------------
2025-05-07T20:33:31.1792725Z     | Traceback (most recent call last):
2025-05-07T20:33:31.1793722Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:31.1794838Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:31.1795325Z     |                              ^^^^^^^^
2025-05-07T20:33:31.1796247Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:31.1797351Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.1797868Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:31.1799081Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:31.1800138Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:31.1800792Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:31.1801558Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.1802016Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:31.1802696Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:31.1803517Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:31.1804166Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:31.1805145Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:31.1806178Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:31.1806732Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:31.1807603Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:31.1808418Z     |     fn()
2025-05-07T20:33:31.1809245Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:31.1810171Z     |     self.fn.run(
2025-05-07T20:33:31.1810927Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:31.1811781Z     |     kernel = self.compile(
2025-05-07T20:33:31.1812169Z     |              ^^^^^^^^^^^^^
2025-05-07T20:33:31.1813048Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:31.1814073Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.1814776Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:31.1815720Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:31.1816868Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.1817582Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:31.1818135Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.1818648Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:31.1819030Z     | ^
2025-05-07T20:33:31.1819696Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.1820531Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:31.1821102Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:31.1821828Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.1822462Z     |     T=1,  # or any other generated value
2025-05-07T20:33:31.1822986Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:31.1823465Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:31.1823986Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:31.1824563Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:31.1825053Z     | )
2025-05-07T20:33:31.1825326Z     | 
2025-05-07T20:33:31.1826396Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:31.1827440Z     +------------------------------------
2025-05-07T20:33:31.1827942Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:31.1828479Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.1829061Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.1829645Z     T=1,
2025-05-07T20:33:31.1829923Z     D=5120,
2025-05-07T20:33:31.1830212Z     scale_ub=None,
2025-05-07T20:33:31.1830511Z     contiguous=True,
2025-05-07T20:33:31.1830826Z     compiled=True,
2025-05-07T20:33:31.1832045Z )
2025-05-07T20:33:31.1832501Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.1833200Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:31.1833577Z 
2025-05-07T20:33:31.1833697Z     @given(
2025-05-07T20:33:31.1834113Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.1834541Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.1834981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.1835452Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.1835917Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.1836347Z     )
2025-05-07T20:33:31.1836853Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.1837507Z     def test_silu_mul_quant(
2025-05-07T20:33:31.1837854Z         self,
2025-05-07T20:33:31.1838134Z         T: int,
2025-05-07T20:33:31.1838403Z         D: int,
2025-05-07T20:33:31.1838717Z         scale_ub: Optional[float],
2025-05-07T20:33:31.1839093Z         contiguous: bool,
2025-05-07T20:33:31.1839426Z         compiled: bool,
2025-05-07T20:33:31.1839758Z     ) -> None:
2025-05-07T20:33:31.1840054Z         torch.manual_seed(2025)
2025-05-07T20:33:31.1840378Z     
2025-05-07T20:33:31.1840755Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.1841231Z     
2025-05-07T20:33:31.1841489Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.1841866Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.1842271Z         x = x_sign * x_clamp
2025-05-07T20:33:31.1842584Z         x0 = x[:, :D]
2025-05-07T20:33:31.1842871Z         x1 = x[:, D:]
2025-05-07T20:33:31.1843161Z     
2025-05-07T20:33:31.1843416Z         if contiguous:
2025-05-07T20:33:31.1843733Z             x0 = x0.contiguous()
2025-05-07T20:33:31.1844093Z             x1 = x1.contiguous()
2025-05-07T20:33:31.1844433Z     
2025-05-07T20:33:31.1844703Z         if scale_ub is not None:
2025-05-07T20:33:31.1845086Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.1845540Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.1845954Z             )
2025-05-07T20:33:31.1846211Z         else:
2025-05-07T20:33:31.1846490Z             scale_ub_tensor = None
2025-05-07T20:33:31.1846823Z     
2025-05-07T20:33:31.1847132Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.1847542Z             op = silu_mul_quant
2025-05-07T20:33:31.1847899Z             if compiled:
2025-05-07T20:33:31.1848238Z                 op = torch.compile(op)
2025-05-07T20:33:31.1848663Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.1849063Z     
2025-05-07T20:33:31.1849324Z         y_fp8, y_scale = fn()
2025-05-07T20:33:31.1849728Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:31.1850266Z     
2025-05-07T20:33:31.1850590Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.1851069Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:31.1851488Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:31.1851981Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:31.1852467Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.1852879Z     
2025-05-07T20:33:31.1853145Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:31.1853534Z 
2025-05-07T20:33:31.1853667Z moe/activation_test.py:126: 
2025-05-07T20:33:31.1854059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.1854636Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:31.1855140Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.1856232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:31.1857264Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:31.1857994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.1858914Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.1859894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:31.1860892Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:31.1861894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:31.1862834Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:31.1863681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:31.1864416Z     fn()
2025-05-07T20:33:31.1865114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:31.1865938Z     self.fn.run(
2025-05-07T20:33:31.1866579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.1867311Z     kernel = self.compile(
2025-05-07T20:33:31.1868053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.1868959Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.1869504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.1869820Z 
2025-05-07T20:33:31.1870096Z self = <triton.compiler.compiler.ASTSource object at 0x7f09a1ca4800>
2025-05-07T20:33:31.1871573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.1873520Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09a057dc60>}
2025-05-07T20:33:31.1875508Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.1877034Z context = <triton._C.libtriton.ir.context object at 0x7f09e0d198b0>
2025-05-07T20:33:31.1877455Z 
2025-05-07T20:33:31.1877696Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.1878424Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.1879163Z                            module_map=module_map)
2025-05-07T20:33:31.1879664Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.1880193Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:31.1880560Z E       ^
2025-05-07T20:33:31.1881252Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.1881870Z 
2025-05-07T20:33:31.1882431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.1883174Z 
2025-05-07T20:33:31.1883312Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.1883852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.1884372Z     T=2048,
2025-05-07T20:33:31.1884618Z     D=5120,
2025-05-07T20:33:31.1884881Z     scale_ub=1200.0,
2025-05-07T20:33:31.1885211Z     contiguous=True,
2025-05-07T20:33:31.1885516Z     compiled=False,
2025-05-07T20:33:31.1885788Z )
2025-05-07T20:33:31.1886212Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.1886865Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:31.1887239Z 
2025-05-07T20:33:31.1887346Z     @given(
2025-05-07T20:33:31.1887672Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.1888078Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.1888536Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.1888994Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.1889448Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.1889839Z     )
2025-05-07T20:33:31.1890338Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.1890997Z     def test_silu_mul_quant(
2025-05-07T20:33:31.1891324Z         self,
2025-05-07T20:33:31.1891583Z         T: int,
2025-05-07T20:33:31.1891850Z         D: int,
2025-05-07T20:33:31.1892139Z         scale_ub: Optional[float],
2025-05-07T20:33:31.1892509Z         contiguous: bool,
2025-05-07T20:33:31.1892831Z         compiled: bool,
2025-05-07T20:33:31.1893123Z     ) -> None:
2025-05-07T20:33:31.1893408Z         torch.manual_seed(2025)
2025-05-07T20:33:31.1893736Z     
2025-05-07T20:33:31.1894089Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.1894665Z     
2025-05-07T20:33:31.1894931Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.1895314Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.1895729Z         x = x_sign * x_clamp
2025-05-07T20:33:31.1896047Z         x0 = x[:, :D]
2025-05-07T20:33:31.1896332Z         x1 = x[:, D:]
2025-05-07T20:33:31.1896615Z     
2025-05-07T20:33:31.1896871Z         if contiguous:
2025-05-07T20:33:31.1897188Z             x0 = x0.contiguous()
2025-05-07T20:33:31.1897562Z             x1 = x1.contiguous()
2025-05-07T20:33:31.1897912Z     
2025-05-07T20:33:31.1898195Z         if scale_ub is not None:
2025-05-07T20:33:31.1898566Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.1899000Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.1899428Z             )
2025-05-07T20:33:31.1899678Z         else:
2025-05-07T20:33:31.1899957Z             scale_ub_tensor = None
2025-05-07T20:33:31.1900317Z     
2025-05-07T20:33:31.1900625Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.1901068Z             op = silu_mul_quant
2025-05-07T20:33:31.1901421Z             if compiled:
2025-05-07T20:33:31.1901767Z                 op = torch.compile(op)
2025-05-07T20:33:31.1902183Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.1902582Z     
2025-05-07T20:33:31.1902831Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.1903058Z 
2025-05-07T20:33:31.1903189Z moe/activation_test.py:117: 
2025-05-07T20:33:31.1903587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.1904116Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.1904494Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.1905462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.1906512Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.1907303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.1908321Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.1909361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.1910156Z     kernel = self.compile(
2025-05-07T20:33:31.1910895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.1911805Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.1912379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.1912721Z 
2025-05-07T20:33:31.1913025Z self = <triton.compiler.compiler.ASTSource object at 0x7f09a058e960>
2025-05-07T20:33:31.1914661Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.1916665Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09a03d4220>}
2025-05-07T20:33:31.1918646Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.1920137Z context = <triton._C.libtriton.ir.context object at 0x7f09a084c3f0>
2025-05-07T20:33:31.1920548Z 
2025-05-07T20:33:31.1920777Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.1921523Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.1922213Z                            module_map=module_map)
2025-05-07T20:33:31.1922726Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.1923216Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.1923575Z E       ^
2025-05-07T20:33:31.1924239Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.1924903Z 
2025-05-07T20:33:31.1925720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.1926494Z 
2025-05-07T20:33:31.1926643Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.1927240Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.1927827Z     T=2048,
2025-05-07T20:33:31.1928086Z     D=5120,
2025-05-07T20:33:31.1928350Z     scale_ub=1200.0,
2025-05-07T20:33:31.1928664Z     contiguous=True,
2025-05-07T20:33:31.1928970Z     compiled=True,
2025-05-07T20:33:31.1929257Z )
2025-05-07T20:33:31.1929713Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.1930414Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:31.1930804Z 
2025-05-07T20:33:31.1930915Z     @given(
2025-05-07T20:33:31.1931242Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.1931680Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.1932110Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.1932577Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.1933036Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.1933641Z     )
2025-05-07T20:33:31.1934131Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.1934880Z     def test_silu_mul_quant(
2025-05-07T20:33:31.1935216Z         self,
2025-05-07T20:33:31.1935480Z         T: int,
2025-05-07T20:33:31.1935868Z         D: int,
2025-05-07T20:33:31.1936174Z         scale_ub: Optional[float],
2025-05-07T20:33:31.1936575Z         contiguous: bool,
2025-05-07T20:33:31.1936926Z         compiled: bool,
2025-05-07T20:33:31.1937248Z     ) -> None:
2025-05-07T20:33:31.1937557Z         torch.manual_seed(2025)
2025-05-07T20:33:31.1937989Z     
2025-05-07T20:33:31.1938382Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.1938868Z     
2025-05-07T20:33:31.1939140Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.1939547Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.1939976Z         x = x_sign * x_clamp
2025-05-07T20:33:31.1940303Z         x0 = x[:, :D]
2025-05-07T20:33:31.1940598Z         x1 = x[:, D:]
2025-05-07T20:33:31.1940868Z     
2025-05-07T20:33:31.1941114Z         if contiguous:
2025-05-07T20:33:31.1941446Z             x0 = x0.contiguous()
2025-05-07T20:33:31.1941777Z             x1 = x1.contiguous()
2025-05-07T20:33:31.1942087Z     
2025-05-07T20:33:31.1942355Z         if scale_ub is not None:
2025-05-07T20:33:31.1942695Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.1943209Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.1943656Z             )
2025-05-07T20:33:31.1943928Z         else:
2025-05-07T20:33:31.1944233Z             scale_ub_tensor = None
2025-05-07T20:33:31.1944640Z     
2025-05-07T20:33:31.1944953Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.1945392Z             op = silu_mul_quant
2025-05-07T20:33:31.1945882Z             if compiled:
2025-05-07T20:33:31.1946342Z                 op = torch.compile(op)
2025-05-07T20:33:31.1946820Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.1947623Z     
2025-05-07T20:33:31.1947961Z         y_fp8, y_scale = fn()
2025-05-07T20:33:31.1966969Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:31.1967276Z     
2025-05-07T20:33:31.1967531Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.1967875Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:31.1968170Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:31.1968490Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:31.1968856Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.1969166Z     
2025-05-07T20:33:31.1969365Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:31.1969565Z 
2025-05-07T20:33:31.1969668Z moe/activation_test.py:126: 
2025-05-07T20:33:31.1969964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.1970306Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:31.1970638Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.1971459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:31.1972247Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:31.1972812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.1973543Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.1974276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:31.1975150Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:31.1975933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:31.1976710Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:31.1977347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:31.1977903Z     fn()
2025-05-07T20:33:31.1979716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:31.1980342Z     self.fn.run(
2025-05-07T20:33:31.1980841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.1981464Z     kernel = self.compile(
2025-05-07T20:33:31.1982042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.1982738Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.1983155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.1983412Z 
2025-05-07T20:33:31.1983626Z self = <triton.compiler.compiler.ASTSource object at 0x7f09a0420ef0>
2025-05-07T20:33:31.1984813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.1986302Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09a03d56c0>}
2025-05-07T20:33:31.1987732Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.1988833Z context = <triton._C.libtriton.ir.context object at 0x7f099b1cdcf0>
2025-05-07T20:33:31.1989139Z 
2025-05-07T20:33:31.1989316Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.1989861Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.1990341Z                            module_map=module_map)
2025-05-07T20:33:31.1990718Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.1991088Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:31.1991363Z E       ^
2025-05-07T20:33:31.1991856Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.1992336Z 
2025-05-07T20:33:31.1992777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.1993322Z 
2025-05-07T20:33:31.1993429Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.1993849Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.1994269Z     T=16384,
2025-05-07T20:33:31.1994522Z     D=7168,
2025-05-07T20:33:31.1994738Z     scale_ub=1200.0,
2025-05-07T20:33:31.1994976Z     contiguous=False,
2025-05-07T20:33:31.1995212Z     compiled=False,
2025-05-07T20:33:31.1995427Z )
2025-05-07T20:33:31.1995769Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.1996312Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:31.1996609Z 
2025-05-07T20:33:31.1996699Z     @given(
2025-05-07T20:33:31.1996942Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.1997274Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.1997594Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.1997941Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.1998287Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.1998600Z     )
2025-05-07T20:33:31.1998967Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.1999496Z     def test_silu_mul_quant(
2025-05-07T20:33:31.1999749Z         self,
2025-05-07T20:33:31.1999951Z         T: int,
2025-05-07T20:33:31.2000149Z         D: int,
2025-05-07T20:33:31.2000375Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2000665Z         contiguous: bool,
2025-05-07T20:33:31.2000962Z         compiled: bool,
2025-05-07T20:33:31.2001194Z     ) -> None:
2025-05-07T20:33:31.2001418Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2001657Z     
2025-05-07T20:33:31.2001935Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2002327Z     
2025-05-07T20:33:31.2002518Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2002814Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2003129Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2003371Z         x0 = x[:, :D]
2025-05-07T20:33:31.2003597Z         x1 = x[:, D:]
2025-05-07T20:33:31.2003800Z     
2025-05-07T20:33:31.2003979Z         if contiguous:
2025-05-07T20:33:31.2004212Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2004490Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2004744Z     
2025-05-07T20:33:31.2004984Z         if scale_ub is not None:
2025-05-07T20:33:31.2005297Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2005647Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2005987Z             )
2025-05-07T20:33:31.2006202Z         else:
2025-05-07T20:33:31.2006498Z             scale_ub_tensor = None
2025-05-07T20:33:31.2006776Z     
2025-05-07T20:33:31.2007030Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2007378Z             op = silu_mul_quant
2025-05-07T20:33:31.2007644Z             if compiled:
2025-05-07T20:33:31.2007915Z                 op = torch.compile(op)
2025-05-07T20:33:31.2008241Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2008534Z     
2025-05-07T20:33:31.2008747Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2008922Z 
2025-05-07T20:33:31.2009043Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2009354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2009720Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2010024Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2010750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2011494Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2012068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2012794Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2013495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2014067Z     kernel = self.compile(
2025-05-07T20:33:31.2014689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2015388Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2015800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2016057Z 
2025-05-07T20:33:31.2016273Z self = <triton.compiler.compiler.ASTSource object at 0x7f09a04e38c0>
2025-05-07T20:33:31.2017411Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2018846Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099b2c0180>}
2025-05-07T20:33:31.2020263Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2021427Z context = <triton._C.libtriton.ir.context object at 0x7f099b225f30>
2025-05-07T20:33:31.2021740Z 
2025-05-07T20:33:31.2021953Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2022505Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2022997Z                            module_map=module_map)
2025-05-07T20:33:31.2023429Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2023808Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2024085Z E       ^
2025-05-07T20:33:31.2024583Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2025114Z 
2025-05-07T20:33:31.2025822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2026407Z 
2025-05-07T20:33:31.2026526Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2026955Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2027384Z     T=1,
2025-05-07T20:33:31.2027595Z     D=7168,
2025-05-07T20:33:31.2027796Z     scale_ub=None,
2025-05-07T20:33:31.2028028Z     contiguous=True,
2025-05-07T20:33:31.2028267Z     compiled=True,
2025-05-07T20:33:31.2028589Z )
2025-05-07T20:33:31.2028918Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2029428Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:31.2029702Z 
2025-05-07T20:33:31.2029790Z     @given(
2025-05-07T20:33:31.2030026Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2030356Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2030681Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2031018Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2031358Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2031662Z     )
2025-05-07T20:33:31.2032030Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2032492Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2032746Z         self,
2025-05-07T20:33:31.2032963Z         T: int,
2025-05-07T20:33:31.2033172Z         D: int,
2025-05-07T20:33:31.2033414Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2033713Z         contiguous: bool,
2025-05-07T20:33:31.2033963Z         compiled: bool,
2025-05-07T20:33:31.2034211Z     ) -> None:
2025-05-07T20:33:31.2034483Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2034739Z     
2025-05-07T20:33:31.2035035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2035408Z     
2025-05-07T20:33:31.2035623Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2035940Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2036273Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2036524Z         x0 = x[:, :D]
2025-05-07T20:33:31.2036762Z         x1 = x[:, D:]
2025-05-07T20:33:31.2036991Z     
2025-05-07T20:33:31.2037191Z         if contiguous:
2025-05-07T20:33:31.2037448Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2037729Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2037991Z     
2025-05-07T20:33:31.2038196Z         if scale_ub is not None:
2025-05-07T20:33:31.2038501Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2038861Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2039179Z             )
2025-05-07T20:33:31.2039385Z         else:
2025-05-07T20:33:31.2039596Z             scale_ub_tensor = None
2025-05-07T20:33:31.2039848Z     
2025-05-07T20:33:31.2040089Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2040515Z             op = silu_mul_quant
2025-05-07T20:33:31.2040790Z             if compiled:
2025-05-07T20:33:31.2041066Z                 op = torch.compile(op)
2025-05-07T20:33:31.2041389Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2041682Z     
2025-05-07T20:33:31.2041901Z         y_fp8, y_scale = fn()
2025-05-07T20:33:31.2042281Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:31.2042591Z     
2025-05-07T20:33:31.2042865Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2043236Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:31.2043624Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:31.2043964Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:31.2044362Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2044707Z     
2025-05-07T20:33:31.2044923Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:31.2045143Z 
2025-05-07T20:33:31.2045259Z moe/activation_test.py:126: 
2025-05-07T20:33:31.2045594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2045969Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:31.2046331Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2047166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:31.2048033Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:31.2048623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2049363Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2050093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:31.2050870Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:31.2051656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:31.2052341Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:31.2052980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:31.2053540Z     fn()
2025-05-07T20:33:31.2054077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:31.2054776Z     self.fn.run(
2025-05-07T20:33:31.2055294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2055861Z     kernel = self.compile(
2025-05-07T20:33:31.2056428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2057116Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2057535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2057777Z 
2025-05-07T20:33:31.2057998Z self = <triton.compiler.compiler.ASTSource object at 0x7f099b405ac0>
2025-05-07T20:33:31.2059143Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2060576Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099b2c0cc0>}
2025-05-07T20:33:31.2061996Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2063144Z context = <triton._C.libtriton.ir.context object at 0x7f099aa7acb0>
2025-05-07T20:33:31.2063449Z 
2025-05-07T20:33:31.2063632Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2064214Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2064709Z                            module_map=module_map)
2025-05-07T20:33:31.2065097Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2065481Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:31.2065801Z E       ^
2025-05-07T20:33:31.2066293Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2066771Z 
2025-05-07T20:33:31.2067222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2067771Z 
2025-05-07T20:33:31.2067880Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2068325Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2068755Z     T=4096,
2025-05-07T20:33:31.2068958Z     D=5120,
2025-05-07T20:33:31.2069149Z     scale_ub=None,
2025-05-07T20:33:31.2069374Z     contiguous=False,
2025-05-07T20:33:31.2069611Z     compiled=False,
2025-05-07T20:33:31.2069817Z )
2025-05-07T20:33:31.2070151Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2070721Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:31.2071012Z 
2025-05-07T20:33:31.2071093Z     @given(
2025-05-07T20:33:31.2071333Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2071653Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2071962Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2072308Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2072652Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2072953Z     )
2025-05-07T20:33:31.2073309Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2073775Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2074030Z         self,
2025-05-07T20:33:31.2074227Z         T: int,
2025-05-07T20:33:31.2074437Z         D: int,
2025-05-07T20:33:31.2074662Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2074943Z         contiguous: bool,
2025-05-07T20:33:31.2075199Z         compiled: bool,
2025-05-07T20:33:31.2075426Z     ) -> None:
2025-05-07T20:33:31.2075644Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2075892Z     
2025-05-07T20:33:31.2076180Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2076541Z     
2025-05-07T20:33:31.2076744Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2077051Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2077365Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2077627Z         x0 = x[:, :D]
2025-05-07T20:33:31.2077854Z         x1 = x[:, D:]
2025-05-07T20:33:31.2078078Z     
2025-05-07T20:33:31.2078262Z         if contiguous:
2025-05-07T20:33:31.2078505Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2078779Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2079028Z     
2025-05-07T20:33:31.2079234Z         if scale_ub is not None:
2025-05-07T20:33:31.2079520Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2079863Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2080194Z             )
2025-05-07T20:33:31.2080410Z         else:
2025-05-07T20:33:31.2080629Z             scale_ub_tensor = None
2025-05-07T20:33:31.2080909Z     
2025-05-07T20:33:31.2081151Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2081471Z             op = silu_mul_quant
2025-05-07T20:33:31.2081735Z             if compiled:
2025-05-07T20:33:31.2081993Z                 op = torch.compile(op)
2025-05-07T20:33:31.2082351Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2082644Z     
2025-05-07T20:33:31.2082855Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2083020Z 
2025-05-07T20:33:31.2083136Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2083486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2083836Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2084140Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2084908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2085730Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2086314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2087051Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2087761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2088339Z     kernel = self.compile(
2025-05-07T20:33:31.2088931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2089635Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2090065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2090365Z 
2025-05-07T20:33:31.2090586Z self = <triton.compiler.compiler.ASTSource object at 0x7f099af1ab70>
2025-05-07T20:33:31.2091735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2093182Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f09a03b7240>}
2025-05-07T20:33:31.2094708Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2095813Z context = <triton._C.libtriton.ir.context object at 0x7f099a22dcf0>
2025-05-07T20:33:31.2096123Z 
2025-05-07T20:33:31.2096314Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2096874Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2097367Z                            module_map=module_map)
2025-05-07T20:33:31.2097756Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2098135Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2098410Z E       ^
2025-05-07T20:33:31.2098904Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2099388Z 
2025-05-07T20:33:31.2099838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2100387Z 
2025-05-07T20:33:31.2100514Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2100947Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2101377Z     T=4096,
2025-05-07T20:33:31.2101585Z     D=7168,
2025-05-07T20:33:31.2101785Z     scale_ub=None,
2025-05-07T20:33:31.2102018Z     contiguous=False,
2025-05-07T20:33:31.2102250Z     compiled=False,
2025-05-07T20:33:31.2102455Z )
2025-05-07T20:33:31.2102791Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2103312Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:31.2103601Z 
2025-05-07T20:33:31.2103688Z     @given(
2025-05-07T20:33:31.2103968Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2104286Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2104643Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2104978Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2105365Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2105666Z     )
2025-05-07T20:33:31.2106020Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2106478Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2106761Z         self,
2025-05-07T20:33:31.2106950Z         T: int,
2025-05-07T20:33:31.2107153Z         D: int,
2025-05-07T20:33:31.2107370Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2107645Z         contiguous: bool,
2025-05-07T20:33:31.2107878Z         compiled: bool,
2025-05-07T20:33:31.2108098Z     ) -> None:
2025-05-07T20:33:31.2108314Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2108551Z     
2025-05-07T20:33:31.2108835Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2109192Z     
2025-05-07T20:33:31.2109376Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2109665Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2109984Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2110220Z         x0 = x[:, :D]
2025-05-07T20:33:31.2110434Z         x1 = x[:, D:]
2025-05-07T20:33:31.2110655Z     
2025-05-07T20:33:31.2110887Z         if contiguous:
2025-05-07T20:33:31.2111128Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2111401Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2111639Z     
2025-05-07T20:33:31.2111842Z         if scale_ub is not None:
2025-05-07T20:33:31.2112133Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2112477Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2112802Z             )
2025-05-07T20:33:31.2113006Z         else:
2025-05-07T20:33:31.2113233Z             scale_ub_tensor = None
2025-05-07T20:33:31.2113491Z     
2025-05-07T20:33:31.2113735Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2114067Z             op = silu_mul_quant
2025-05-07T20:33:31.2114318Z             if compiled:
2025-05-07T20:33:31.2114575Z                 op = torch.compile(op)
2025-05-07T20:33:31.2114880Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2115158Z     
2025-05-07T20:33:31.2115367Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2115540Z 
2025-05-07T20:33:31.2115651Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2115960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2116311Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2116600Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2117327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2118056Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2118617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2119343Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2120042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2120605Z     kernel = self.compile(
2025-05-07T20:33:31.2121174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2121880Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2122286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2122538Z 
2025-05-07T20:33:31.2122749Z self = <triton.compiler.compiler.ASTSource object at 0x7f099b16ede0>
2025-05-07T20:33:31.2123938Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2125590Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099a879ee0>}
2025-05-07T20:33:31.2127093Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2128269Z context = <triton._C.libtriton.ir.context object at 0x7f099aac1ef0>
2025-05-07T20:33:31.2128577Z 
2025-05-07T20:33:31.2128747Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2129292Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2129405Z                            module_map=module_map)
2025-05-07T20:33:31.2129571Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2129678Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2129756Z E       ^
2025-05-07T20:33:31.2130139Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2130144Z 
2025-05-07T20:33:31.2130636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2130644Z 
2025-05-07T20:33:31.2136835Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2137091Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2137171Z     T=128,
2025-05-07T20:33:31.2137247Z     D=7168,
2025-05-07T20:33:31.2137330Z     scale_ub=None,
2025-05-07T20:33:31.2137417Z     contiguous=False,
2025-05-07T20:33:31.2137506Z     compiled=True,
2025-05-07T20:33:31.2137582Z )
2025-05-07T20:33:31.2137811Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2137985Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:31.2137990Z 
2025-05-07T20:33:31.2138072Z     @given(
2025-05-07T20:33:31.2138194Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2138293Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2138412Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2138531Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2138644Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2138719Z     )
2025-05-07T20:33:31.2138971Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2139067Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2139145Z         self,
2025-05-07T20:33:31.2139227Z         T: int,
2025-05-07T20:33:31.2139307Z         D: int,
2025-05-07T20:33:31.2139403Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2139490Z         contiguous: bool,
2025-05-07T20:33:31.2139576Z         compiled: bool,
2025-05-07T20:33:31.2139653Z     ) -> None:
2025-05-07T20:33:31.2139747Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2139825Z     
2025-05-07T20:33:31.2139995Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2140068Z     
2025-05-07T20:33:31.2140166Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2140292Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2140386Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2140474Z         x0 = x[:, :D]
2025-05-07T20:33:31.2140556Z         x1 = x[:, D:]
2025-05-07T20:33:31.2140627Z     
2025-05-07T20:33:31.2140715Z         if contiguous:
2025-05-07T20:33:31.2140804Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2140894Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2141070Z     
2025-05-07T20:33:31.2141160Z         if scale_ub is not None:
2025-05-07T20:33:31.2141272Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2141406Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2141481Z             )
2025-05-07T20:33:31.2141558Z         else:
2025-05-07T20:33:31.2141713Z             scale_ub_tensor = None
2025-05-07T20:33:31.2141785Z     
2025-05-07T20:33:31.2141919Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2142012Z             op = silu_mul_quant
2025-05-07T20:33:31.2142137Z             if compiled:
2025-05-07T20:33:31.2142244Z                 op = torch.compile(op)
2025-05-07T20:33:31.2142349Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2142426Z     
2025-05-07T20:33:31.2142518Z         y_fp8, y_scale = fn()
2025-05-07T20:33:31.2142640Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:31.2142716Z     
2025-05-07T20:33:31.2142849Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2142952Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:31.2143058Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:31.2143176Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:31.2143317Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2143395Z     
2025-05-07T20:33:31.2143493Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:31.2143498Z 
2025-05-07T20:33:31.2143733Z moe/activation_test.py:126: 
2025-05-07T20:33:31.2143869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2143979Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:31.2144114Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2144754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:31.2144858Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:31.2145248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2145475Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2145871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:31.2146134Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:31.2146532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:31.2146708Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:31.2147065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:31.2147141Z     fn()
2025-05-07T20:33:31.2147566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:31.2147651Z     self.fn.run(
2025-05-07T20:33:31.2148011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2148109Z     kernel = self.compile(
2025-05-07T20:33:31.2148510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2148693Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2148825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2148830Z 
2025-05-07T20:33:31.2149040Z self = <triton.compiler.compiler.ASTSource object at 0x7f099b00a4e0>
2025-05-07T20:33:31.2149855Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2150422Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099a7f9120>}
2025-05-07T20:33:31.2151280Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2151478Z context = <triton._C.libtriton.ir.context object at 0x7f099a8d5eb0>
2025-05-07T20:33:31.2151520Z 
2025-05-07T20:33:31.2151695Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2151972Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2152084Z                            module_map=module_map)
2025-05-07T20:33:31.2152259Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2152368Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:31.2152456Z E       ^
2025-05-07T20:33:31.2152827Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2152832Z 
2025-05-07T20:33:31.2153270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2153275Z 
2025-05-07T20:33:31.2153433Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2153662Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2153755Z     T=128,
2025-05-07T20:33:31.2153837Z     D=7168,
2025-05-07T20:33:31.2153923Z     scale_ub=None,
2025-05-07T20:33:31.2154021Z     contiguous=False,
2025-05-07T20:33:31.2154108Z     compiled=False,
2025-05-07T20:33:31.2154186Z )
2025-05-07T20:33:31.2154416Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2154598Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:31.2154603Z 
2025-05-07T20:33:31.2154684Z     @given(
2025-05-07T20:33:31.2154810Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2154912Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2155036Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2155160Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2155279Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2155372Z     )
2025-05-07T20:33:31.2155625Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2155721Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2155807Z         self,
2025-05-07T20:33:31.2155888Z         T: int,
2025-05-07T20:33:31.2155967Z         D: int,
2025-05-07T20:33:31.2156075Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2156172Z         contiguous: bool,
2025-05-07T20:33:31.2156269Z         compiled: bool,
2025-05-07T20:33:31.2156354Z     ) -> None:
2025-05-07T20:33:31.2156452Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2156526Z     
2025-05-07T20:33:31.2156704Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2156781Z     
2025-05-07T20:33:31.2156885Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2157011Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2157102Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2157187Z         x0 = x[:, :D]
2025-05-07T20:33:31.2157271Z         x1 = x[:, D:]
2025-05-07T20:33:31.2157346Z     
2025-05-07T20:33:31.2157441Z         if contiguous:
2025-05-07T20:33:31.2157535Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2157629Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2157712Z     
2025-05-07T20:33:31.2157804Z         if scale_ub is not None:
2025-05-07T20:33:31.2157911Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2158099Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2158182Z             )
2025-05-07T20:33:31.2158264Z         else:
2025-05-07T20:33:31.2158360Z             scale_ub_tensor = None
2025-05-07T20:33:31.2158436Z     
2025-05-07T20:33:31.2158617Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2158711Z             op = silu_mul_quant
2025-05-07T20:33:31.2158798Z             if compiled:
2025-05-07T20:33:31.2158912Z                 op = torch.compile(op)
2025-05-07T20:33:31.2159021Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2159139Z     
2025-05-07T20:33:31.2159239Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2159243Z 
2025-05-07T20:33:31.2159342Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2159480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2159589Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2159692Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2160223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2160323Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2160704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2160936Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2161342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2161453Z     kernel = self.compile(
2025-05-07T20:33:31.2161864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2162043Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2162184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2162191Z 
2025-05-07T20:33:31.2162408Z self = <triton.compiler.compiler.ASTSource object at 0x7f099b2f1a30>
2025-05-07T20:33:31.2163234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2163760Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099a4b8b80>}
2025-05-07T20:33:31.2164562Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2164770Z context = <triton._C.libtriton.ir.context object at 0x7f099a92e370>
2025-05-07T20:33:31.2164774Z 
2025-05-07T20:33:31.2164979Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2165282Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2165392Z                            module_map=module_map)
2025-05-07T20:33:31.2165561Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2165671Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2165752Z E       ^
2025-05-07T20:33:31.2166130Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2166141Z 
2025-05-07T20:33:31.2166582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2166586Z 
2025-05-07T20:33:31.2166695Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2166933Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2167064Z     T=4096,
2025-05-07T20:33:31.2167143Z     D=5120,
2025-05-07T20:33:31.2167234Z     scale_ub=1200.0,
2025-05-07T20:33:31.2167324Z     contiguous=True,
2025-05-07T20:33:31.2167413Z     compiled=False,
2025-05-07T20:33:31.2167497Z )
2025-05-07T20:33:31.2167766Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2167955Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:31.2167960Z 
2025-05-07T20:33:31.2168047Z     @given(
2025-05-07T20:33:31.2168168Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2168313Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2168436Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2168557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2168681Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2168762Z     )
2025-05-07T20:33:31.2169014Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2169121Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2169202Z         self,
2025-05-07T20:33:31.2169288Z         T: int,
2025-05-07T20:33:31.2169370Z         D: int,
2025-05-07T20:33:31.2169478Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2169580Z         contiguous: bool,
2025-05-07T20:33:31.2169670Z         compiled: bool,
2025-05-07T20:33:31.2169751Z     ) -> None:
2025-05-07T20:33:31.2169855Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2169980Z     
2025-05-07T20:33:31.2170159Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2170247Z     
2025-05-07T20:33:31.2170340Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2170466Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2170560Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2170644Z         x0 = x[:, :D]
2025-05-07T20:33:31.2170729Z         x1 = x[:, D:]
2025-05-07T20:33:31.2170808Z     
2025-05-07T20:33:31.2170894Z         if contiguous:
2025-05-07T20:33:31.2170994Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2171087Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2171158Z     
2025-05-07T20:33:31.2171260Z         if scale_ub is not None:
2025-05-07T20:33:31.2171373Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2171505Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2171593Z             )
2025-05-07T20:33:31.2171676Z         else:
2025-05-07T20:33:31.2171774Z             scale_ub_tensor = None
2025-05-07T20:33:31.2171856Z     
2025-05-07T20:33:31.2171986Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2172081Z             op = silu_mul_quant
2025-05-07T20:33:31.2172171Z             if compiled:
2025-05-07T20:33:31.2172273Z                 op = torch.compile(op)
2025-05-07T20:33:31.2172390Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2172468Z     
2025-05-07T20:33:31.2172563Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2172568Z 
2025-05-07T20:33:31.2172671Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2172803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2172907Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2173018Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2173547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2173659Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2174040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2174292Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2174784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2174879Z     kernel = self.compile(
2025-05-07T20:33:31.2175333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2175513Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2175685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2175690Z 
2025-05-07T20:33:31.2175902Z self = <triton.compiler.compiler.ASTSource object at 0x7f099b2f2360>
2025-05-07T20:33:31.2176716Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2177275Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099a4b9b20>}
2025-05-07T20:33:31.2178073Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2178271Z context = <triton._C.libtriton.ir.context object at 0x7f08e5b33e30>
2025-05-07T20:33:31.2178276Z 
2025-05-07T20:33:31.2178449Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2178761Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2178869Z                            module_map=module_map)
2025-05-07T20:33:31.2179036Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2179138Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2179218Z E       ^
2025-05-07T20:33:31.2179587Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2179592Z 
2025-05-07T20:33:31.2180032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2180037Z 
2025-05-07T20:33:31.2180147Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2180379Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2180462Z     T=1,
2025-05-07T20:33:31.2180540Z     D=5120,
2025-05-07T20:33:31.2180625Z     scale_ub=None,
2025-05-07T20:33:31.2180720Z     contiguous=True,
2025-05-07T20:33:31.2180809Z     compiled=True,
2025-05-07T20:33:31.2180887Z )
2025-05-07T20:33:31.2181119Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2181284Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:31.2181289Z 
2025-05-07T20:33:31.2181369Z     @given(
2025-05-07T20:33:31.2181497Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2181600Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2181730Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2181847Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2181959Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2182042Z     )
2025-05-07T20:33:31.2182299Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2182396Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2182486Z         self,
2025-05-07T20:33:31.2182566Z         T: int,
2025-05-07T20:33:31.2182646Z         D: int,
2025-05-07T20:33:31.2182755Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2182852Z         contiguous: bool,
2025-05-07T20:33:31.2182943Z         compiled: bool,
2025-05-07T20:33:31.2183029Z     ) -> None:
2025-05-07T20:33:31.2183121Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2183197Z     
2025-05-07T20:33:31.2183367Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2183438Z     
2025-05-07T20:33:31.2183606Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2183728Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2183817Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2183897Z         x0 = x[:, :D]
2025-05-07T20:33:31.2183972Z         x1 = x[:, D:]
2025-05-07T20:33:31.2184044Z     
2025-05-07T20:33:31.2184175Z         if contiguous:
2025-05-07T20:33:31.2184270Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2184355Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2184431Z     
2025-05-07T20:33:31.2184522Z         if scale_ub is not None:
2025-05-07T20:33:31.2184665Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2184805Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2184885Z             )
2025-05-07T20:33:31.2184970Z         else:
2025-05-07T20:33:31.2185064Z             scale_ub_tensor = None
2025-05-07T20:33:31.2185137Z     
2025-05-07T20:33:31.2185269Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2185357Z             op = silu_mul_quant
2025-05-07T20:33:31.2185438Z             if compiled:
2025-05-07T20:33:31.2185535Z                 op = torch.compile(op)
2025-05-07T20:33:31.2185643Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2185713Z     
2025-05-07T20:33:31.2185815Z         y_fp8, y_scale = fn()
2025-05-07T20:33:31.2185931Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:31.2186003Z     
2025-05-07T20:33:31.2186182Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2186288Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:31.2186396Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:31.2186518Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:31.2186656Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2186739Z     
2025-05-07T20:33:31.2186837Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:31.2186844Z 
2025-05-07T20:33:31.2186937Z moe/activation_test.py:126: 
2025-05-07T20:33:31.2187074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2187175Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:31.2187313Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2187905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:31.2188011Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:31.2188405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2188632Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2189015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:31.2189283Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:31.2189678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:31.2189849Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:31.2190207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:31.2190278Z     fn()
2025-05-07T20:33:31.2190707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:31.2190787Z     self.fn.run(
2025-05-07T20:33:31.2191149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2191240Z     kernel = self.compile(
2025-05-07T20:33:31.2191639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2191877Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2192008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2192013Z 
2025-05-07T20:33:31.2192225Z self = <triton.compiler.compiler.ASTSource object at 0x7f099a7fe900>
2025-05-07T20:33:31.2193088Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2193640Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099a4baca0>}
2025-05-07T20:33:31.2194439Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2194635Z context = <triton._C.libtriton.ir.context object at 0x7f099a94d8b0>
2025-05-07T20:33:31.2194639Z 
2025-05-07T20:33:31.2194817Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2195096Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2195205Z                            module_map=module_map)
2025-05-07T20:33:31.2195411Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2195520Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:31.2195603Z E       ^
2025-05-07T20:33:31.2195983Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2195987Z 
2025-05-07T20:33:31.2196425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2196433Z 
2025-05-07T20:33:31.2196542Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2196771Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2196850Z     T=2048,
2025-05-07T20:33:31.2196934Z     D=5120,
2025-05-07T20:33:31.2197019Z     scale_ub=None,
2025-05-07T20:33:31.2197111Z     contiguous=True,
2025-05-07T20:33:31.2197201Z     compiled=True,
2025-05-07T20:33:31.2197276Z )
2025-05-07T20:33:31.2197509Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2197688Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:31.2197697Z 
2025-05-07T20:33:31.2197775Z     @given(
2025-05-07T20:33:31.2197897Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2198007Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2198123Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2198248Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2198369Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2198449Z     )
2025-05-07T20:33:31.2198702Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2198805Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2198888Z         self,
2025-05-07T20:33:31.2198970Z         T: int,
2025-05-07T20:33:31.2199050Z         D: int,
2025-05-07T20:33:31.2199151Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2199245Z         contiguous: bool,
2025-05-07T20:33:31.2199335Z         compiled: bool,
2025-05-07T20:33:31.2199423Z     ) -> None:
2025-05-07T20:33:31.2199516Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2199592Z     
2025-05-07T20:33:31.2199760Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2199830Z     
2025-05-07T20:33:31.2199923Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2200043Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2200178Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2200255Z         x0 = x[:, :D]
2025-05-07T20:33:31.2200329Z         x1 = x[:, D:]
2025-05-07T20:33:31.2200405Z     
2025-05-07T20:33:31.2200491Z         if contiguous:
2025-05-07T20:33:31.2200578Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2200704Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2200784Z     
2025-05-07T20:33:31.2200873Z         if scale_ub is not None:
2025-05-07T20:33:31.2200976Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2201115Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2201234Z             )
2025-05-07T20:33:31.2201312Z         else:
2025-05-07T20:33:31.2201405Z             scale_ub_tensor = None
2025-05-07T20:33:31.2201478Z     
2025-05-07T20:33:31.2201608Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2201699Z             op = silu_mul_quant
2025-05-07T20:33:31.2201790Z             if compiled:
2025-05-07T20:33:31.2201890Z                 op = torch.compile(op)
2025-05-07T20:33:31.2201994Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2202071Z     
2025-05-07T20:33:31.2202162Z         y_fp8, y_scale = fn()
2025-05-07T20:33:31.2202281Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:31.2202354Z     
2025-05-07T20:33:31.2202502Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2202606Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:31.2202760Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:31.2202884Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:31.2203031Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2203110Z     
2025-05-07T20:33:31.2203207Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:31.2203211Z 
2025-05-07T20:33:31.2203306Z moe/activation_test.py:126: 
2025-05-07T20:33:31.2203444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2203553Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:31.2203697Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2204304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:31.2204418Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:31.2204828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2205053Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2205445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:31.2205711Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:31.2206104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:31.2206279Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:31.2206642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:31.2206718Z     fn()
2025-05-07T20:33:31.2207150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:31.2207234Z     self.fn.run(
2025-05-07T20:33:31.2207598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2207688Z     kernel = self.compile(
2025-05-07T20:33:31.2208087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2208271Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2208400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2208450Z 
2025-05-07T20:33:31.2208657Z self = <triton.compiler.compiler.ASTSource object at 0x7f099a14ea20>
2025-05-07T20:33:31.2209517Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2210035Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099a565e40>}
2025-05-07T20:33:31.2210876Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2211069Z context = <triton._C.libtriton.ir.context object at 0x7f08e5a84cb0>
2025-05-07T20:33:31.2211076Z 
2025-05-07T20:33:31.2211261Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2211538Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2211646Z                            module_map=module_map)
2025-05-07T20:33:31.2211824Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2211926Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:31.2212002Z E       ^
2025-05-07T20:33:31.2212453Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2212461Z 
2025-05-07T20:33:31.2212901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2212906Z 
2025-05-07T20:33:31.2213020Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2213249Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2213328Z     T=128,
2025-05-07T20:33:31.2213408Z     D=5120,
2025-05-07T20:33:31.2213492Z     scale_ub=None,
2025-05-07T20:33:31.2213576Z     contiguous=True,
2025-05-07T20:33:31.2213662Z     compiled=True,
2025-05-07T20:33:31.2213740Z )
2025-05-07T20:33:31.2213966Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2214145Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:31.2214150Z 
2025-05-07T20:33:31.2214234Z     @given(
2025-05-07T20:33:31.2214364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2214555Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2214672Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2214803Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2214917Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2214990Z     )
2025-05-07T20:33:31.2215250Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2215347Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2215430Z         self,
2025-05-07T20:33:31.2215504Z         T: int,
2025-05-07T20:33:31.2215580Z         D: int,
2025-05-07T20:33:31.2215678Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2215766Z         contiguous: bool,
2025-05-07T20:33:31.2215851Z         compiled: bool,
2025-05-07T20:33:31.2215927Z     ) -> None:
2025-05-07T20:33:31.2216022Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2216093Z     
2025-05-07T20:33:31.2216275Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2216346Z     
2025-05-07T20:33:31.2216436Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2216564Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2216651Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2216736Z         x0 = x[:, :D]
2025-05-07T20:33:31.2216812Z         x1 = x[:, D:]
2025-05-07T20:33:31.2216933Z     
2025-05-07T20:33:31.2217014Z         if contiguous:
2025-05-07T20:33:31.2217102Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2217186Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2217258Z     
2025-05-07T20:33:31.2217348Z         if scale_ub is not None:
2025-05-07T20:33:31.2217492Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2217631Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2217702Z             )
2025-05-07T20:33:31.2217781Z         else:
2025-05-07T20:33:31.2217877Z             scale_ub_tensor = None
2025-05-07T20:33:31.2217991Z     
2025-05-07T20:33:31.2218118Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2218212Z             op = silu_mul_quant
2025-05-07T20:33:31.2218297Z             if compiled:
2025-05-07T20:33:31.2218397Z                 op = torch.compile(op)
2025-05-07T20:33:31.2218501Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2218570Z     
2025-05-07T20:33:31.2218668Z         y_fp8, y_scale = fn()
2025-05-07T20:33:31.2218787Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:31.2218858Z     
2025-05-07T20:33:31.2218996Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2219097Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:31.2219197Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:31.2219322Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:31.2219504Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2219581Z     
2025-05-07T20:33:31.2219677Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:31.2219682Z 
2025-05-07T20:33:31.2219775Z moe/activation_test.py:126: 
2025-05-07T20:33:31.2219907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2220008Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:31.2220140Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2220744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:31.2220847Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:31.2221232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2221459Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2221844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:31.2222110Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:31.2222511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:31.2222678Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:31.2223039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:31.2223113Z     fn()
2025-05-07T20:33:31.2223539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:31.2223620Z     self.fn.run(
2025-05-07T20:33:31.2223973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2224073Z     kernel = self.compile(
2025-05-07T20:33:31.2224471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2224648Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2224781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2224786Z 
2025-05-07T20:33:31.2224991Z self = <triton.compiler.compiler.ASTSource object at 0x7f099a59f2f0>
2025-05-07T20:33:31.2226126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2226730Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e5902ac0>}
2025-05-07T20:33:31.2227534Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2227790Z context = <triton._C.libtriton.ir.context object at 0x7f099a343230>
2025-05-07T20:33:31.2227795Z 
2025-05-07T20:33:31.2227966Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2228244Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2228355Z                            module_map=module_map)
2025-05-07T20:33:31.2228521Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2228623Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:31.2228704Z E       ^
2025-05-07T20:33:31.2229079Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2229084Z 
2025-05-07T20:33:31.2229587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2229594Z 
2025-05-07T20:33:31.2229706Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2229936Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2230013Z     T=4096,
2025-05-07T20:33:31.2230092Z     D=5120,
2025-05-07T20:33:31.2230178Z     scale_ub=None,
2025-05-07T20:33:31.2230266Z     contiguous=True,
2025-05-07T20:33:31.2230356Z     compiled=True,
2025-05-07T20:33:31.2230430Z )
2025-05-07T20:33:31.2230655Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2230835Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:31.2230843Z 
2025-05-07T20:33:31.2230920Z     @given(
2025-05-07T20:33:31.2231040Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2231145Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2231264Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2231389Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2231508Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2231584Z     )
2025-05-07T20:33:31.2231845Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2231942Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2232018Z         self,
2025-05-07T20:33:31.2232102Z         T: int,
2025-05-07T20:33:31.2232178Z         D: int,
2025-05-07T20:33:31.2232280Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2232373Z         contiguous: bool,
2025-05-07T20:33:31.2232458Z         compiled: bool,
2025-05-07T20:33:31.2232535Z     ) -> None:
2025-05-07T20:33:31.2232636Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2232707Z     
2025-05-07T20:33:31.2232883Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2232959Z     
2025-05-07T20:33:31.2233051Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2233190Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2233278Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2233360Z         x0 = x[:, :D]
2025-05-07T20:33:31.2233446Z         x1 = x[:, D:]
2025-05-07T20:33:31.2233521Z     
2025-05-07T20:33:31.2233605Z         if contiguous:
2025-05-07T20:33:31.2233702Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2233862Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2233940Z     
2025-05-07T20:33:31.2234034Z         if scale_ub is not None:
2025-05-07T20:33:31.2234141Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2234302Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2234386Z             )
2025-05-07T20:33:31.2234521Z         else:
2025-05-07T20:33:31.2234624Z             scale_ub_tensor = None
2025-05-07T20:33:31.2234699Z     
2025-05-07T20:33:31.2234837Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2234931Z             op = silu_mul_quant
2025-05-07T20:33:31.2235056Z             if compiled:
2025-05-07T20:33:31.2235154Z                 op = torch.compile(op)
2025-05-07T20:33:31.2235264Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2235337Z     
2025-05-07T20:33:31.2235428Z         y_fp8, y_scale = fn()
2025-05-07T20:33:31.2235552Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:31.2235624Z     
2025-05-07T20:33:31.2235767Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2235868Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:31.2235968Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:31.2236094Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:31.2236240Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2236312Z     
2025-05-07T20:33:31.2236462Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:31.2236467Z 
2025-05-07T20:33:31.2236566Z moe/activation_test.py:126: 
2025-05-07T20:33:31.2236703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2236818Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:31.2236950Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2237552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:31.2237659Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:31.2238039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2238277Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2238664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:31.2238938Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:31.2239334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:31.2239502Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:31.2239863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:31.2239941Z     fn()
2025-05-07T20:33:31.2240367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:31.2240453Z     self.fn.run(
2025-05-07T20:33:31.2240809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2240905Z     kernel = self.compile(
2025-05-07T20:33:31.2241309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2241484Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2241619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2241624Z 
2025-05-07T20:33:31.2241830Z self = <triton.compiler.compiler.ASTSource object at 0x7f099a59ef90>
2025-05-07T20:33:31.2242640Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2243206Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e5d734c0>}
2025-05-07T20:33:31.2244040Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2244302Z context = <triton._C.libtriton.ir.context object at 0x7f08e5f661f0>
2025-05-07T20:33:31.2244306Z 
2025-05-07T20:33:31.2244477Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2244760Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2244870Z                            module_map=module_map)
2025-05-07T20:33:31.2245039Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2245151Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:31.2245230Z E       ^
2025-05-07T20:33:31.2245600Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2245616Z 
2025-05-07T20:33:31.2246053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2246099Z 
2025-05-07T20:33:31.2246208Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2246451Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2246529Z     T=16384,
2025-05-07T20:33:31.2246608Z     D=5120,
2025-05-07T20:33:31.2246698Z     scale_ub=None,
2025-05-07T20:33:31.2246784Z     contiguous=True,
2025-05-07T20:33:31.2246866Z     compiled=True,
2025-05-07T20:33:31.2246945Z )
2025-05-07T20:33:31.2247169Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2247353Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:31.2247357Z 
2025-05-07T20:33:31.2247436Z     @given(
2025-05-07T20:33:31.2247554Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2247665Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2247781Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2247903Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2248025Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2248105Z     )
2025-05-07T20:33:31.2248362Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2248456Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2248537Z         self,
2025-05-07T20:33:31.2248615Z         T: int,
2025-05-07T20:33:31.2248693Z         D: int,
2025-05-07T20:33:31.2248794Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2248892Z         contiguous: bool,
2025-05-07T20:33:31.2248978Z         compiled: bool,
2025-05-07T20:33:31.2249055Z     ) -> None:
2025-05-07T20:33:31.2249159Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2249230Z     
2025-05-07T20:33:31.2249407Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2249493Z     
2025-05-07T20:33:31.2249588Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2249716Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2249812Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2249897Z         x0 = x[:, :D]
2025-05-07T20:33:31.2249982Z         x1 = x[:, D:]
2025-05-07T20:33:31.2250055Z     
2025-05-07T20:33:31.2250141Z         if contiguous:
2025-05-07T20:33:31.2250238Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2250329Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2250407Z     
2025-05-07T20:33:31.2250508Z         if scale_ub is not None:
2025-05-07T20:33:31.2250666Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2250802Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2250887Z             )
2025-05-07T20:33:31.2250964Z         else:
2025-05-07T20:33:31.2251059Z             scale_ub_tensor = None
2025-05-07T20:33:31.2251136Z     
2025-05-07T20:33:31.2251307Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2251403Z             op = silu_mul_quant
2025-05-07T20:33:31.2251488Z             if compiled:
2025-05-07T20:33:31.2251592Z                 op = torch.compile(op)
2025-05-07T20:33:31.2251744Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2251817Z     
2025-05-07T20:33:31.2251914Z         y_fp8, y_scale = fn()
2025-05-07T20:33:31.2252045Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:31.2252124Z     
2025-05-07T20:33:31.2252261Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2252371Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:31.2252475Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:31.2252597Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:31.2252743Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2252819Z     
2025-05-07T20:33:31.2252929Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:31.2252933Z 
2025-05-07T20:33:31.2253035Z moe/activation_test.py:126: 
2025-05-07T20:33:31.2253213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2253329Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:31.2253468Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2254061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:31.2254171Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:31.2254673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2254915Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2255306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:31.2255570Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:31.2255974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:31.2256151Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:31.2256513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:31.2256592Z     fn()
2025-05-07T20:33:31.2257014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:31.2257112Z     self.fn.run(
2025-05-07T20:33:31.2257471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2257569Z     kernel = self.compile(
2025-05-07T20:33:31.2257978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2258161Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2258303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2258311Z 
2025-05-07T20:33:31.2258525Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e5d41820>
2025-05-07T20:33:31.2259338Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2259905Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e52b9580>}
2025-05-07T20:33:31.2260735Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2260935Z context = <triton._C.libtriton.ir.context object at 0x7f08e58e9c70>
2025-05-07T20:33:31.2260941Z 
2025-05-07T20:33:31.2261111Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2261425Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2261535Z                            module_map=module_map)
2025-05-07T20:33:31.2261698Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2261815Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:31.2261898Z E       ^
2025-05-07T20:33:31.2262269Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2262273Z 
2025-05-07T20:33:31.2262725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2262730Z 
2025-05-07T20:33:31.2262833Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2263120Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2263206Z     T=1,
2025-05-07T20:33:31.2267325Z     D=5120,
2025-05-07T20:33:31.2267430Z     scale_ub=1200.0,
2025-05-07T20:33:31.2267516Z     contiguous=True,
2025-05-07T20:33:31.2267603Z     compiled=True,
2025-05-07T20:33:31.2267680Z )
2025-05-07T20:33:31.2267910Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2268077Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:31.2268087Z 
2025-05-07T20:33:31.2268174Z     @given(
2025-05-07T20:33:31.2268295Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2268396Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2268517Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2268635Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2268754Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2268828Z     )
2025-05-07T20:33:31.2269085Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2269188Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2269264Z         self,
2025-05-07T20:33:31.2269342Z         T: int,
2025-05-07T20:33:31.2269422Z         D: int,
2025-05-07T20:33:31.2269518Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2269608Z         contiguous: bool,
2025-05-07T20:33:31.2269692Z         compiled: bool,
2025-05-07T20:33:31.2269767Z     ) -> None:
2025-05-07T20:33:31.2269862Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2269934Z     
2025-05-07T20:33:31.2270102Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2270175Z     
2025-05-07T20:33:31.2270266Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2270392Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2270480Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2270556Z         x0 = x[:, :D]
2025-05-07T20:33:31.2270637Z         x1 = x[:, D:]
2025-05-07T20:33:31.2270712Z     
2025-05-07T20:33:31.2270797Z         if contiguous:
2025-05-07T20:33:31.2270884Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2270973Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2271043Z     
2025-05-07T20:33:31.2271130Z         if scale_ub is not None:
2025-05-07T20:33:31.2271234Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2271371Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2271519Z             )
2025-05-07T20:33:31.2271602Z         else:
2025-05-07T20:33:31.2271698Z             scale_ub_tensor = None
2025-05-07T20:33:31.2271773Z     
2025-05-07T20:33:31.2271904Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2271995Z             op = silu_mul_quant
2025-05-07T20:33:31.2272129Z             if compiled:
2025-05-07T20:33:31.2272230Z                 op = torch.compile(op)
2025-05-07T20:33:31.2272337Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2272416Z     
2025-05-07T20:33:31.2272508Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2272554Z 
2025-05-07T20:33:31.2272655Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2272791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2272897Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2272999Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2273393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2273494Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2274020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2274125Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2274501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2274779Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2275143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2275239Z     kernel = self.compile(
2025-05-07T20:33:31.2275647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2275833Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2275974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2275978Z 
2025-05-07T20:33:31.2276187Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e5963a10>
2025-05-07T20:33:31.2277008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2277531Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e500c680>}
2025-05-07T20:33:31.2278335Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2278537Z context = <triton._C.libtriton.ir.context object at 0x7f08e4cb4e70>
2025-05-07T20:33:31.2278544Z 
2025-05-07T20:33:31.2278718Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2278993Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2279107Z                            module_map=module_map)
2025-05-07T20:33:31.2279274Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2279379Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2279456Z E       ^
2025-05-07T20:33:31.2279831Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2279839Z 
2025-05-07T20:33:31.2280277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2280282Z 
2025-05-07T20:33:31.2280388Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2280673Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2280751Z     T=1,
2025-05-07T20:33:31.2280830Z     D=5120,
2025-05-07T20:33:31.2280915Z     scale_ub=None,
2025-05-07T20:33:31.2281003Z     contiguous=False,
2025-05-07T20:33:31.2281089Z     compiled=True,
2025-05-07T20:33:31.2281167Z )
2025-05-07T20:33:31.2281459Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2281634Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:31.2281638Z 
2025-05-07T20:33:31.2281851Z     @given(
2025-05-07T20:33:31.2281972Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2282085Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2282200Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2282316Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2282434Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2282511Z     )
2025-05-07T20:33:31.2282772Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2282865Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2282940Z         self,
2025-05-07T20:33:31.2283023Z         T: int,
2025-05-07T20:33:31.2283098Z         D: int,
2025-05-07T20:33:31.2283203Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2283304Z         contiguous: bool,
2025-05-07T20:33:31.2283392Z         compiled: bool,
2025-05-07T20:33:31.2283514Z     ) -> None:
2025-05-07T20:33:31.2283621Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2283695Z     
2025-05-07T20:33:31.2283867Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2283945Z     
2025-05-07T20:33:31.2284039Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2284168Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2284261Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2284340Z         x0 = x[:, :D]
2025-05-07T20:33:31.2284433Z         x1 = x[:, D:]
2025-05-07T20:33:31.2284506Z     
2025-05-07T20:33:31.2284591Z         if contiguous:
2025-05-07T20:33:31.2284687Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2284776Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2284855Z     
2025-05-07T20:33:31.2284973Z         if scale_ub is not None:
2025-05-07T20:33:31.2285101Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2285243Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2285327Z             )
2025-05-07T20:33:31.2285406Z         else:
2025-05-07T20:33:31.2285503Z             scale_ub_tensor = None
2025-05-07T20:33:31.2285579Z     
2025-05-07T20:33:31.2285709Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2285807Z             op = silu_mul_quant
2025-05-07T20:33:31.2285891Z             if compiled:
2025-05-07T20:33:31.2285989Z                 op = torch.compile(op)
2025-05-07T20:33:31.2286098Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2286176Z     
2025-05-07T20:33:31.2286268Z         y_fp8, y_scale = fn()
2025-05-07T20:33:31.2286395Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:31.2286468Z     
2025-05-07T20:33:31.2286609Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2286720Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:31.2286821Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:31.2286947Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:31.2287095Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2287174Z     
2025-05-07T20:33:31.2287279Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:31.2287284Z 
2025-05-07T20:33:31.2287381Z moe/activation_test.py:126: 
2025-05-07T20:33:31.2287514Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2287625Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:31.2287808Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2288399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:31.2288506Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:31.2288925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2289169Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2289604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:31.2289872Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:31.2290277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:31.2290454Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:31.2290823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:31.2290897Z     fn()
2025-05-07T20:33:31.2291323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:31.2291415Z     self.fn.run(
2025-05-07T20:33:31.2291811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2291914Z     kernel = self.compile(
2025-05-07T20:33:31.2292325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2292511Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2292651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2292656Z 
2025-05-07T20:33:31.2292872Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e52ee900>
2025-05-07T20:33:31.2293698Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2294227Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e500eb60>}
2025-05-07T20:33:31.2295129Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2295360Z context = <triton._C.libtriton.ir.context object at 0x7f08e4c619f0>
2025-05-07T20:33:31.2295365Z 
2025-05-07T20:33:31.2295535Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2295814Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2295926Z                            module_map=module_map)
2025-05-07T20:33:31.2296088Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2296201Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:31.2296280Z E       ^
2025-05-07T20:33:31.2296655Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2296660Z 
2025-05-07T20:33:31.2297105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2297110Z 
2025-05-07T20:33:31.2297217Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2297449Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2297525Z     T=1,
2025-05-07T20:33:31.2297603Z     D=5120,
2025-05-07T20:33:31.2297735Z     scale_ub=None,
2025-05-07T20:33:31.2297819Z     contiguous=True,
2025-05-07T20:33:31.2297904Z     compiled=False,
2025-05-07T20:33:31.2297989Z )
2025-05-07T20:33:31.2298214Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2298419Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:31.2298424Z 
2025-05-07T20:33:31.2298507Z     @given(
2025-05-07T20:33:31.2298632Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2298737Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2298895Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2299014Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2299134Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2299212Z     )
2025-05-07T20:33:31.2299471Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2299572Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2299650Z         self,
2025-05-07T20:33:31.2299729Z         T: int,
2025-05-07T20:33:31.2299808Z         D: int,
2025-05-07T20:33:31.2299908Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2300003Z         contiguous: bool,
2025-05-07T20:33:31.2300093Z         compiled: bool,
2025-05-07T20:33:31.2300179Z     ) -> None:
2025-05-07T20:33:31.2300277Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2300349Z     
2025-05-07T20:33:31.2300566Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2300651Z     
2025-05-07T20:33:31.2300743Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2300868Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2300961Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2301040Z         x0 = x[:, :D]
2025-05-07T20:33:31.2301125Z         x1 = x[:, D:]
2025-05-07T20:33:31.2301207Z     
2025-05-07T20:33:31.2301290Z         if contiguous:
2025-05-07T20:33:31.2301386Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2301486Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2301560Z     
2025-05-07T20:33:31.2301656Z         if scale_ub is not None:
2025-05-07T20:33:31.2301760Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2301903Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2301981Z             )
2025-05-07T20:33:31.2302059Z         else:
2025-05-07T20:33:31.2302154Z             scale_ub_tensor = None
2025-05-07T20:33:31.2302234Z     
2025-05-07T20:33:31.2302365Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2302464Z             op = silu_mul_quant
2025-05-07T20:33:31.2302550Z             if compiled:
2025-05-07T20:33:31.2302650Z                 op = torch.compile(op)
2025-05-07T20:33:31.2302756Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2302836Z     
2025-05-07T20:33:31.2302932Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2302936Z 
2025-05-07T20:33:31.2303041Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2303180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2303281Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2303384Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2303914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2304012Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2304407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2304684Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2305046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2305140Z     kernel = self.compile(
2025-05-07T20:33:31.2305545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2305779Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2305910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2305915Z 
2025-05-07T20:33:31.2306169Z self = <triton.compiler.compiler.ASTSource object at 0x7f099a59ec60>
2025-05-07T20:33:31.2306989Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2307545Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e500f9c0>}
2025-05-07T20:33:31.2308345Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2308548Z context = <triton._C.libtriton.ir.context object at 0x7f08e4c7a730>
2025-05-07T20:33:31.2308553Z 
2025-05-07T20:33:31.2308728Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2309000Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2309147Z                            module_map=module_map)
2025-05-07T20:33:31.2309314Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2309420Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2309496Z E       ^
2025-05-07T20:33:31.2309874Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2309879Z 
2025-05-07T20:33:31.2310315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2310322Z 
2025-05-07T20:33:31.2310431Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2310662Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2310741Z     T=128,
2025-05-07T20:33:31.2310822Z     D=5120,
2025-05-07T20:33:31.2310907Z     scale_ub=None,
2025-05-07T20:33:31.2310998Z     contiguous=False,
2025-05-07T20:33:31.2311083Z     compiled=True,
2025-05-07T20:33:31.2311160Z )
2025-05-07T20:33:31.2311391Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2311569Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:31.2311573Z 
2025-05-07T20:33:31.2311657Z     @given(
2025-05-07T20:33:31.2311780Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2311881Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2311997Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2312127Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2312244Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2312328Z     )
2025-05-07T20:33:31.2312582Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2312680Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2312758Z         self,
2025-05-07T20:33:31.2312841Z         T: int,
2025-05-07T20:33:31.2312919Z         D: int,
2025-05-07T20:33:31.2313026Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2313119Z         contiguous: bool,
2025-05-07T20:33:31.2313210Z         compiled: bool,
2025-05-07T20:33:31.2313300Z     ) -> None:
2025-05-07T20:33:31.2313396Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2313473Z     
2025-05-07T20:33:31.2313654Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2313728Z     
2025-05-07T20:33:31.2313826Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2314025Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2314114Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2314201Z         x0 = x[:, :D]
2025-05-07T20:33:31.2314278Z         x1 = x[:, D:]
2025-05-07T20:33:31.2314354Z     
2025-05-07T20:33:31.2314441Z         if contiguous:
2025-05-07T20:33:31.2314573Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2314671Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2314768Z     
2025-05-07T20:33:31.2314872Z         if scale_ub is not None:
2025-05-07T20:33:31.2315005Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2315185Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2315259Z             )
2025-05-07T20:33:31.2315341Z         else:
2025-05-07T20:33:31.2315441Z             scale_ub_tensor = None
2025-05-07T20:33:31.2315516Z     
2025-05-07T20:33:31.2315653Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2315743Z             op = silu_mul_quant
2025-05-07T20:33:31.2315832Z             if compiled:
2025-05-07T20:33:31.2315933Z                 op = torch.compile(op)
2025-05-07T20:33:31.2316039Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2316111Z     
2025-05-07T20:33:31.2316208Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2316213Z 
2025-05-07T20:33:31.2316315Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2316448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2316594Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2316695Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2317089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2317188Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2317716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2317824Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2318201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2318431Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2318803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2318900Z     kernel = self.compile(
2025-05-07T20:33:31.2319312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2319498Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2319630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2319634Z 
2025-05-07T20:33:31.2319850Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e5044c50>
2025-05-07T20:33:31.2320663Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2321187Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e500ca40>}
2025-05-07T20:33:31.2321986Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2322185Z context = <triton._C.libtriton.ir.context object at 0x7f08e4b4ae70>
2025-05-07T20:33:31.2322190Z 
2025-05-07T20:33:31.2322365Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2322638Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2322788Z                            module_map=module_map)
2025-05-07T20:33:31.2322950Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2323051Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2323134Z E       ^
2025-05-07T20:33:31.2323544Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2323549Z 
2025-05-07T20:33:31.2324000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2324044Z 
2025-05-07T20:33:31.2324151Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2324384Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2324469Z     T=128,
2025-05-07T20:33:31.2324548Z     D=7168,
2025-05-07T20:33:31.2324630Z     scale_ub=1200.0,
2025-05-07T20:33:31.2324714Z     contiguous=False,
2025-05-07T20:33:31.2324813Z     compiled=False,
2025-05-07T20:33:31.2324890Z )
2025-05-07T20:33:31.2325144Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2325319Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:31.2325324Z 
2025-05-07T20:33:31.2325589Z     @given(
2025-05-07T20:33:31.2325769Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2325906Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2326118Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2326246Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2326373Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2326449Z     )
2025-05-07T20:33:31.2326737Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2326838Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2326918Z         self,
2025-05-07T20:33:31.2326995Z         T: int,
2025-05-07T20:33:31.2327077Z         D: int,
2025-05-07T20:33:31.2327173Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2327259Z         contiguous: bool,
2025-05-07T20:33:31.2327344Z         compiled: bool,
2025-05-07T20:33:31.2327421Z     ) -> None:
2025-05-07T20:33:31.2327513Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2327586Z     
2025-05-07T20:33:31.2327762Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2327834Z     
2025-05-07T20:33:31.2327927Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2328053Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2328146Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2328222Z         x0 = x[:, :D]
2025-05-07T20:33:31.2328301Z         x1 = x[:, D:]
2025-05-07T20:33:31.2328374Z     
2025-05-07T20:33:31.2328455Z         if contiguous:
2025-05-07T20:33:31.2328542Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2328630Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2328700Z     
2025-05-07T20:33:31.2328793Z         if scale_ub is not None:
2025-05-07T20:33:31.2328896Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2329026Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2329103Z             )
2025-05-07T20:33:31.2329174Z         else:
2025-05-07T20:33:31.2329267Z             scale_ub_tensor = None
2025-05-07T20:33:31.2329343Z     
2025-05-07T20:33:31.2329470Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2329559Z             op = silu_mul_quant
2025-05-07T20:33:31.2329641Z             if compiled:
2025-05-07T20:33:31.2329739Z                 op = torch.compile(op)
2025-05-07T20:33:31.2329842Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2329913Z     
2025-05-07T20:33:31.2330000Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2330004Z 
2025-05-07T20:33:31.2330098Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2330231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2330397Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2330497Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2331020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2331174Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2331555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2331784Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2332204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2332299Z     kernel = self.compile(
2025-05-07T20:33:31.2332703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2332883Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2333015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2333020Z 
2025-05-07T20:33:31.2333232Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e52eec90>
2025-05-07T20:33:31.2334091Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2334674Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e5d34540>}
2025-05-07T20:33:31.2335471Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2335673Z context = <triton._C.libtriton.ir.context object at 0x7f08e4b462b0>
2025-05-07T20:33:31.2335677Z 
2025-05-07T20:33:31.2335854Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2336130Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2336239Z                            module_map=module_map)
2025-05-07T20:33:31.2336406Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2336510Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2336591Z E       ^
2025-05-07T20:33:31.2336965Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2336969Z 
2025-05-07T20:33:31.2337406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2337410Z 
2025-05-07T20:33:31.2337519Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2337753Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2337835Z     T=128,
2025-05-07T20:33:31.2337916Z     D=5120,
2025-05-07T20:33:31.2338000Z     scale_ub=None,
2025-05-07T20:33:31.2338087Z     contiguous=False,
2025-05-07T20:33:31.2338175Z     compiled=False,
2025-05-07T20:33:31.2338250Z )
2025-05-07T20:33:31.2338482Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2338660Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:31.2338666Z 
2025-05-07T20:33:31.2338747Z     @given(
2025-05-07T20:33:31.2338867Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2338965Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2339078Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2339195Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2339306Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2339425Z     )
2025-05-07T20:33:31.2339680Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2339772Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2339850Z         self,
2025-05-07T20:33:31.2339924Z         T: int,
2025-05-07T20:33:31.2340038Z         D: int,
2025-05-07T20:33:31.2340137Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2340222Z         contiguous: bool,
2025-05-07T20:33:31.2340305Z         compiled: bool,
2025-05-07T20:33:31.2340386Z     ) -> None:
2025-05-07T20:33:31.2340517Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2340588Z     
2025-05-07T20:33:31.2340762Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2340834Z     
2025-05-07T20:33:31.2340922Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2341046Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2341129Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2341215Z         x0 = x[:, :D]
2025-05-07T20:33:31.2341290Z         x1 = x[:, D:]
2025-05-07T20:33:31.2341357Z     
2025-05-07T20:33:31.2341441Z         if contiguous:
2025-05-07T20:33:31.2341529Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2341614Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2341689Z     
2025-05-07T20:33:31.2341777Z         if scale_ub is not None:
2025-05-07T20:33:31.2341877Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2342077Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2342154Z             )
2025-05-07T20:33:31.2342228Z         else:
2025-05-07T20:33:31.2342326Z             scale_ub_tensor = None
2025-05-07T20:33:31.2342394Z     
2025-05-07T20:33:31.2342523Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2342614Z             op = silu_mul_quant
2025-05-07T20:33:31.2342694Z             if compiled:
2025-05-07T20:33:31.2342794Z                 op = torch.compile(op)
2025-05-07T20:33:31.2342898Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2342968Z     
2025-05-07T20:33:31.2343058Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2343062Z 
2025-05-07T20:33:31.2343155Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2343289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2343389Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2343487Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2344015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2344114Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2344487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2344718Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2345074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2345166Z     kernel = self.compile(
2025-05-07T20:33:31.2345573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2345749Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2345879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2345889Z 
2025-05-07T20:33:31.2346093Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4c75c10>
2025-05-07T20:33:31.2346904Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2347419Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e50c0400>}
2025-05-07T20:33:31.2348255Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2348489Z context = <triton._C.libtriton.ir.context object at 0x7f08e4a10070>
2025-05-07T20:33:31.2348494Z 
2025-05-07T20:33:31.2348667Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2348939Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2349089Z                            module_map=module_map)
2025-05-07T20:33:31.2349252Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2349358Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2349434Z E       ^
2025-05-07T20:33:31.2349805Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2349812Z 
2025-05-07T20:33:31.2350250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2350254Z 
2025-05-07T20:33:31.2350360Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2350593Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2350670Z     T=128,
2025-05-07T20:33:31.2350792Z     D=5120,
2025-05-07T20:33:31.2350882Z     scale_ub=1200.0,
2025-05-07T20:33:31.2350970Z     contiguous=True,
2025-05-07T20:33:31.2351056Z     compiled=False,
2025-05-07T20:33:31.2351132Z )
2025-05-07T20:33:31.2351355Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2351528Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:31.2351532Z 
2025-05-07T20:33:31.2351618Z     @given(
2025-05-07T20:33:31.2351738Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2351841Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2351953Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2352071Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2352190Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2352263Z     )
2025-05-07T20:33:31.2352514Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2352614Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2352696Z         self,
2025-05-07T20:33:31.2352773Z         T: int,
2025-05-07T20:33:31.2352855Z         D: int,
2025-05-07T20:33:31.2352955Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2353044Z         contiguous: bool,
2025-05-07T20:33:31.2353131Z         compiled: bool,
2025-05-07T20:33:31.2353208Z     ) -> None:
2025-05-07T20:33:31.2353302Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2353373Z     
2025-05-07T20:33:31.2353540Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2353618Z     
2025-05-07T20:33:31.2353704Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2353824Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2353916Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2353994Z         x0 = x[:, :D]
2025-05-07T20:33:31.2354067Z         x1 = x[:, D:]
2025-05-07T20:33:31.2354143Z     
2025-05-07T20:33:31.2354226Z         if contiguous:
2025-05-07T20:33:31.2354315Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2354405Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2354475Z     
2025-05-07T20:33:31.2354564Z         if scale_ub is not None:
2025-05-07T20:33:31.2354666Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2354812Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2354899Z             )
2025-05-07T20:33:31.2354986Z         else:
2025-05-07T20:33:31.2355135Z             scale_ub_tensor = None
2025-05-07T20:33:31.2355208Z     
2025-05-07T20:33:31.2355334Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2355420Z             op = silu_mul_quant
2025-05-07T20:33:31.2355503Z             if compiled:
2025-05-07T20:33:31.2355642Z                 op = torch.compile(op)
2025-05-07T20:33:31.2355746Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2355819Z     
2025-05-07T20:33:31.2355907Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2355913Z 
2025-05-07T20:33:31.2356010Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2356183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2356280Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2356381Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2356906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2357003Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2357380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2357606Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2357967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2358058Z     kernel = self.compile(
2025-05-07T20:33:31.2358498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2358681Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2358810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2358814Z 
2025-05-07T20:33:31.2359023Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4c77920>
2025-05-07T20:33:31.2359832Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2360346Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e50c1300>}
2025-05-07T20:33:31.2361143Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2361337Z context = <triton._C.libtriton.ir.context object at 0x7f099a47e9f0>
2025-05-07T20:33:31.2361341Z 
2025-05-07T20:33:31.2361511Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2361779Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2361882Z                            module_map=module_map)
2025-05-07T20:33:31.2362044Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2362139Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2362214Z E       ^
2025-05-07T20:33:31.2362584Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2362589Z 
2025-05-07T20:33:31.2363026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2363033Z 
2025-05-07T20:33:31.2363137Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2363362Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2363434Z     T=1,
2025-05-07T20:33:31.2363511Z     D=7168,
2025-05-07T20:33:31.2363590Z     scale_ub=1200.0,
2025-05-07T20:33:31.2363676Z     contiguous=True,
2025-05-07T20:33:31.2363800Z     compiled=True,
2025-05-07T20:33:31.2363869Z )
2025-05-07T20:33:31.2364094Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2364257Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:31.2364262Z 
2025-05-07T20:33:31.2364333Z     @given(
2025-05-07T20:33:31.2364530Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2364635Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2364771Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2364889Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2365038Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2365116Z     )
2025-05-07T20:33:31.2365365Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2365455Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2365533Z         self,
2025-05-07T20:33:31.2365607Z         T: int,
2025-05-07T20:33:31.2365683Z         D: int,
2025-05-07T20:33:31.2365780Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2365865Z         contiguous: bool,
2025-05-07T20:33:31.2365945Z         compiled: bool,
2025-05-07T20:33:31.2366024Z     ) -> None:
2025-05-07T20:33:31.2366116Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2366190Z     
2025-05-07T20:33:31.2366361Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2366433Z     
2025-05-07T20:33:31.2366569Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2366696Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2366784Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2366865Z         x0 = x[:, :D]
2025-05-07T20:33:31.2366942Z         x1 = x[:, D:]
2025-05-07T20:33:31.2367012Z     
2025-05-07T20:33:31.2367095Z         if contiguous:
2025-05-07T20:33:31.2367184Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2367270Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2367346Z     
2025-05-07T20:33:31.2367433Z         if scale_ub is not None:
2025-05-07T20:33:31.2367534Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2367669Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2367740Z             )
2025-05-07T20:33:31.2367815Z         else:
2025-05-07T20:33:31.2367911Z             scale_ub_tensor = None
2025-05-07T20:33:31.2367980Z     
2025-05-07T20:33:31.2368113Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2368204Z             op = silu_mul_quant
2025-05-07T20:33:31.2368286Z             if compiled:
2025-05-07T20:33:31.2368387Z                 op = torch.compile(op)
2025-05-07T20:33:31.2368489Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2368561Z     
2025-05-07T20:33:31.2368651Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2368656Z 
2025-05-07T20:33:31.2368748Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2368876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2368978Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2369074Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2369458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2369550Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2370072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2370169Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2370544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2370770Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2371127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2371217Z     kernel = self.compile(
2025-05-07T20:33:31.2371674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2371853Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2372026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2372031Z 
2025-05-07T20:33:31.2372243Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4c76960>
2025-05-07T20:33:31.2373057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2373636Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e50c2ac0>}
2025-05-07T20:33:31.2374427Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2374716Z context = <triton._C.libtriton.ir.context object at 0x7f08e4aeddf0>
2025-05-07T20:33:31.2374722Z 
2025-05-07T20:33:31.2374923Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2375237Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2375347Z                            module_map=module_map)
2025-05-07T20:33:31.2375513Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2375613Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2375698Z E       ^
2025-05-07T20:33:31.2376069Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2376073Z 
2025-05-07T20:33:31.2376518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2376523Z 
2025-05-07T20:33:31.2376629Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2376859Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2376941Z     T=1,
2025-05-07T20:33:31.2377022Z     D=7168,
2025-05-07T20:33:31.2377106Z     scale_ub=1200.0,
2025-05-07T20:33:31.2377194Z     contiguous=False,
2025-05-07T20:33:31.2377280Z     compiled=True,
2025-05-07T20:33:31.2377353Z )
2025-05-07T20:33:31.2377582Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2377749Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:31.2377754Z 
2025-05-07T20:33:31.2377835Z     @given(
2025-05-07T20:33:31.2377953Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2378052Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2378179Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2378297Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2378412Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2378488Z     )
2025-05-07T20:33:31.2378744Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2378840Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2378919Z         self,
2025-05-07T20:33:31.2378999Z         T: int,
2025-05-07T20:33:31.2379082Z         D: int,
2025-05-07T20:33:31.2379183Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2379274Z         contiguous: bool,
2025-05-07T20:33:31.2379363Z         compiled: bool,
2025-05-07T20:33:31.2379447Z     ) -> None:
2025-05-07T20:33:31.2379544Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2379623Z     
2025-05-07T20:33:31.2379795Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2379917Z     
2025-05-07T20:33:31.2380013Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2380138Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2380230Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2380311Z         x0 = x[:, :D]
2025-05-07T20:33:31.2380390Z         x1 = x[:, D:]
2025-05-07T20:33:31.2380464Z     
2025-05-07T20:33:31.2380586Z         if contiguous:
2025-05-07T20:33:31.2380681Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2380775Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2380852Z     
2025-05-07T20:33:31.2380944Z         if scale_ub is not None:
2025-05-07T20:33:31.2381093Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2381230Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2381308Z             )
2025-05-07T20:33:31.2381388Z         else:
2025-05-07T20:33:31.2381483Z             scale_ub_tensor = None
2025-05-07T20:33:31.2381558Z     
2025-05-07T20:33:31.2381692Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2381787Z             op = silu_mul_quant
2025-05-07T20:33:31.2381875Z             if compiled:
2025-05-07T20:33:31.2381975Z                 op = torch.compile(op)
2025-05-07T20:33:31.2382081Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2382158Z     
2025-05-07T20:33:31.2382257Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2382262Z 
2025-05-07T20:33:31.2382359Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2382538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2382644Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2382745Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2383135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2383228Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2383755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2383856Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2384233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2384473Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2384832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2384932Z     kernel = self.compile(
2025-05-07T20:33:31.2385338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2385520Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2385656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2385661Z 
2025-05-07T20:33:31.2385871Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4c77f80>
2025-05-07T20:33:31.2386687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2387207Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e50039c0>}
2025-05-07T20:33:31.2388002Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2388202Z context = <triton._C.libtriton.ir.context object at 0x7f099a409d70>
2025-05-07T20:33:31.2388206Z 
2025-05-07T20:33:31.2388375Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2388651Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2388804Z                            module_map=module_map)
2025-05-07T20:33:31.2388966Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2389069Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2389145Z E       ^
2025-05-07T20:33:31.2389554Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2389559Z 
2025-05-07T20:33:31.2390005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2390048Z 
2025-05-07T20:33:31.2390152Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2390388Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2390466Z     T=1,
2025-05-07T20:33:31.2394430Z     D=7168,
2025-05-07T20:33:31.2394550Z     scale_ub=None,
2025-05-07T20:33:31.2394656Z     contiguous=False,
2025-05-07T20:33:31.2394740Z     compiled=True,
2025-05-07T20:33:31.2394817Z )
2025-05-07T20:33:31.2395043Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2395208Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:31.2395216Z 
2025-05-07T20:33:31.2395301Z     @given(
2025-05-07T20:33:31.2395421Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2395589Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2395712Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2395838Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2395963Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2396037Z     )
2025-05-07T20:33:31.2396325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2396426Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2396502Z         self,
2025-05-07T20:33:31.2396582Z         T: int,
2025-05-07T20:33:31.2396663Z         D: int,
2025-05-07T20:33:31.2396766Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2396861Z         contiguous: bool,
2025-05-07T20:33:31.2396949Z         compiled: bool,
2025-05-07T20:33:31.2397028Z     ) -> None:
2025-05-07T20:33:31.2397127Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2397200Z     
2025-05-07T20:33:31.2397369Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2397448Z     
2025-05-07T20:33:31.2397540Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2397665Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2397755Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2397835Z         x0 = x[:, :D]
2025-05-07T20:33:31.2397913Z         x1 = x[:, D:]
2025-05-07T20:33:31.2397988Z     
2025-05-07T20:33:31.2398073Z         if contiguous:
2025-05-07T20:33:31.2398162Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2398259Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2398330Z     
2025-05-07T20:33:31.2398420Z         if scale_ub is not None:
2025-05-07T20:33:31.2398527Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2398661Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2398740Z             )
2025-05-07T20:33:31.2398819Z         else:
2025-05-07T20:33:31.2398912Z             scale_ub_tensor = None
2025-05-07T20:33:31.2398987Z     
2025-05-07T20:33:31.2399118Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2399209Z             op = silu_mul_quant
2025-05-07T20:33:31.2399298Z             if compiled:
2025-05-07T20:33:31.2399396Z                 op = torch.compile(op)
2025-05-07T20:33:31.2399501Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2399577Z     
2025-05-07T20:33:31.2399667Z         y_fp8, y_scale = fn()
2025-05-07T20:33:31.2399791Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:31.2399913Z     
2025-05-07T20:33:31.2400049Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2400154Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:31.2400254Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:31.2400375Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:31.2400557Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2400632Z     
2025-05-07T20:33:31.2400731Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:31.2400739Z 
2025-05-07T20:33:31.2400838Z moe/activation_test.py:126: 
2025-05-07T20:33:31.2401008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2401116Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:31.2401251Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2401845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:31.2401954Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:31.2402332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2402564Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2402954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:31.2403261Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:31.2403669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:31.2403843Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:31.2404202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:31.2404289Z     fn()
2025-05-07T20:33:31.2404712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:31.2404794Z     self.fn.run(
2025-05-07T20:33:31.2405195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2405294Z     kernel = self.compile(
2025-05-07T20:33:31.2405704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2405882Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2406016Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2406021Z 
2025-05-07T20:33:31.2406232Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4a1c830>
2025-05-07T20:33:31.2407048Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2407566Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099a478b80>}
2025-05-07T20:33:31.2408360Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2408560Z context = <triton._C.libtriton.ir.context object at 0x7f099a442730>
2025-05-07T20:33:31.2408564Z 
2025-05-07T20:33:31.2408730Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2409000Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2409109Z                            module_map=module_map)
2025-05-07T20:33:31.2409316Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2409421Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:31.2409503Z E       ^
2025-05-07T20:33:31.2409873Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2409941Z 
2025-05-07T20:33:31.2410383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2410390Z 
2025-05-07T20:33:31.2410498Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2410765Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2410846Z     T=1,
2025-05-07T20:33:31.2410922Z     D=5120,
2025-05-07T20:33:31.2411005Z     scale_ub=1200.0,
2025-05-07T20:33:31.2411095Z     contiguous=False,
2025-05-07T20:33:31.2411178Z     compiled=True,
2025-05-07T20:33:31.2411250Z )
2025-05-07T20:33:31.2411481Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2411651Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:31.2411655Z 
2025-05-07T20:33:31.2411737Z     @given(
2025-05-07T20:33:31.2411856Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2411956Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2412072Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2412231Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2412345Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2412427Z     )
2025-05-07T20:33:31.2412680Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2412777Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2412854Z         self,
2025-05-07T20:33:31.2412930Z         T: int,
2025-05-07T20:33:31.2413009Z         D: int,
2025-05-07T20:33:31.2413108Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2413200Z         contiguous: bool,
2025-05-07T20:33:31.2413289Z         compiled: bool,
2025-05-07T20:33:31.2413365Z     ) -> None:
2025-05-07T20:33:31.2413462Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2413537Z     
2025-05-07T20:33:31.2413713Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2413787Z     
2025-05-07T20:33:31.2413881Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2414005Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2414099Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2414189Z         x0 = x[:, :D]
2025-05-07T20:33:31.2414265Z         x1 = x[:, D:]
2025-05-07T20:33:31.2414339Z     
2025-05-07T20:33:31.2414422Z         if contiguous:
2025-05-07T20:33:31.2414596Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2414690Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2414762Z     
2025-05-07T20:33:31.2414853Z         if scale_ub is not None:
2025-05-07T20:33:31.2414982Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2415134Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2415226Z             )
2025-05-07T20:33:31.2415307Z         else:
2025-05-07T20:33:31.2415401Z             scale_ub_tensor = None
2025-05-07T20:33:31.2415476Z     
2025-05-07T20:33:31.2415611Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2415702Z             op = silu_mul_quant
2025-05-07T20:33:31.2415789Z             if compiled:
2025-05-07T20:33:31.2415891Z                 op = torch.compile(op)
2025-05-07T20:33:31.2415997Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2416073Z     
2025-05-07T20:33:31.2416164Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2416168Z 
2025-05-07T20:33:31.2416265Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2416402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2416504Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2416653Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2417045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2417138Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2417798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2417897Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2418276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2418548Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2418904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2419006Z     kernel = self.compile(
2025-05-07T20:33:31.2419409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2419592Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2419726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2419730Z 
2025-05-07T20:33:31.2419942Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4a1cf80>
2025-05-07T20:33:31.2420805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2421333Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099a479e40>}
2025-05-07T20:33:31.2422131Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2422334Z context = <triton._C.libtriton.ir.context object at 0x7f08e4752470>
2025-05-07T20:33:31.2422338Z 
2025-05-07T20:33:31.2422509Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2422793Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2422906Z                            module_map=module_map)
2025-05-07T20:33:31.2423071Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2423180Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2423260Z E       ^
2025-05-07T20:33:31.2423633Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2423638Z 
2025-05-07T20:33:31.2424081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2424088Z 
2025-05-07T20:33:31.2424195Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2424432Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2424513Z     T=1,
2025-05-07T20:33:31.2424592Z     D=5120,
2025-05-07T20:33:31.2424685Z     scale_ub=1200.0,
2025-05-07T20:33:31.2424775Z     contiguous=False,
2025-05-07T20:33:31.2424863Z     compiled=False,
2025-05-07T20:33:31.2424946Z )
2025-05-07T20:33:31.2425177Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2425356Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:31.2425360Z 
2025-05-07T20:33:31.2425639Z     @given(
2025-05-07T20:33:31.2425811Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2425948Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2426070Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2426283Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2426406Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2426484Z     )
2025-05-07T20:33:31.2426745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2426906Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2426987Z         self,
2025-05-07T20:33:31.2427071Z         T: int,
2025-05-07T20:33:31.2427151Z         D: int,
2025-05-07T20:33:31.2427251Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2427347Z         contiguous: bool,
2025-05-07T20:33:31.2427495Z         compiled: bool,
2025-05-07T20:33:31.2427575Z     ) -> None:
2025-05-07T20:33:31.2427672Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2427745Z     
2025-05-07T20:33:31.2427915Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2427994Z     
2025-05-07T20:33:31.2428086Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2428209Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2428305Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2428385Z         x0 = x[:, :D]
2025-05-07T20:33:31.2428465Z         x1 = x[:, D:]
2025-05-07T20:33:31.2428542Z     
2025-05-07T20:33:31.2428623Z         if contiguous:
2025-05-07T20:33:31.2428723Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2428811Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2428886Z     
2025-05-07T20:33:31.2428979Z         if scale_ub is not None:
2025-05-07T20:33:31.2429145Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2429284Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2429364Z             )
2025-05-07T20:33:31.2429440Z         else:
2025-05-07T20:33:31.2429534Z             scale_ub_tensor = None
2025-05-07T20:33:31.2429609Z     
2025-05-07T20:33:31.2429738Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2429828Z             op = silu_mul_quant
2025-05-07T20:33:31.2429918Z             if compiled:
2025-05-07T20:33:31.2430016Z                 op = torch.compile(op)
2025-05-07T20:33:31.2430125Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2430199Z     
2025-05-07T20:33:31.2430290Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2430295Z 
2025-05-07T20:33:31.2430397Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2430529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2430633Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2430735Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2431265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2431366Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2431742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2431969Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2432330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2432423Z     kernel = self.compile(
2025-05-07T20:33:31.2432827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2433009Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2433139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2433145Z 
2025-05-07T20:33:31.2433356Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4a1e360>
2025-05-07T20:33:31.2434165Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2434728Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f099a47aac0>}
2025-05-07T20:33:31.2435563Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2435763Z context = <triton._C.libtriton.ir.context object at 0x7f08e4d445f0>
2025-05-07T20:33:31.2435767Z 
2025-05-07T20:33:31.2435978Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2436251Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2436359Z                            module_map=module_map)
2025-05-07T20:33:31.2436526Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2436628Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2436714Z E       ^
2025-05-07T20:33:31.2437086Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2437091Z 
2025-05-07T20:33:31.2437529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2437534Z 
2025-05-07T20:33:31.2437645Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2437921Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2438011Z     T=16384,
2025-05-07T20:33:31.2438093Z     D=5120,
2025-05-07T20:33:31.2438177Z     scale_ub=1200.0,
2025-05-07T20:33:31.2438270Z     contiguous=False,
2025-05-07T20:33:31.2438359Z     compiled=True,
2025-05-07T20:33:31.2438437Z )
2025-05-07T20:33:31.2438666Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2438850Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:31.2438858Z 
2025-05-07T20:33:31.2438942Z     @given(
2025-05-07T20:33:31.2439067Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2439169Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2439292Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2439415Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2439532Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2439614Z     )
2025-05-07T20:33:31.2439872Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2439972Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2440057Z         self,
2025-05-07T20:33:31.2440140Z         T: int,
2025-05-07T20:33:31.2440221Z         D: int,
2025-05-07T20:33:31.2440331Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2440426Z         contiguous: bool,
2025-05-07T20:33:31.2440516Z         compiled: bool,
2025-05-07T20:33:31.2440609Z     ) -> None:
2025-05-07T20:33:31.2440708Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2440790Z     
2025-05-07T20:33:31.2440963Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2441041Z     
2025-05-07T20:33:31.2441143Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2441274Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2441367Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2441455Z         x0 = x[:, :D]
2025-05-07T20:33:31.2441540Z         x1 = x[:, D:]
2025-05-07T20:33:31.2441618Z     
2025-05-07T20:33:31.2441711Z         if contiguous:
2025-05-07T20:33:31.2441808Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2441904Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2441982Z     
2025-05-07T20:33:31.2442078Z         if scale_ub is not None:
2025-05-07T20:33:31.2442187Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2442328Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2442481Z             )
2025-05-07T20:33:31.2442566Z         else:
2025-05-07T20:33:31.2442663Z             scale_ub_tensor = None
2025-05-07T20:33:31.2442740Z     
2025-05-07T20:33:31.2442875Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2442969Z             op = silu_mul_quant
2025-05-07T20:33:31.2443098Z             if compiled:
2025-05-07T20:33:31.2443204Z                 op = torch.compile(op)
2025-05-07T20:33:31.2443315Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2443391Z     
2025-05-07T20:33:31.2443490Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2443534Z 
2025-05-07T20:33:31.2443638Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2443777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2443881Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2443984Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2444374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2444476Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2445003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2445114Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2445493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2445767Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2446129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2446227Z     kernel = self.compile(
2025-05-07T20:33:31.2446635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2446814Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2446947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2446956Z 
2025-05-07T20:33:31.2447168Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4d00590>
2025-05-07T20:33:31.2447988Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2448508Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4d58180>}
2025-05-07T20:33:31.2449305Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2449505Z context = <triton._C.libtriton.ir.context object at 0x7f08e4d4d8f0>
2025-05-07T20:33:31.2449509Z 
2025-05-07T20:33:31.2449678Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2449949Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2450061Z                            module_map=module_map)
2025-05-07T20:33:31.2450224Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2450329Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2450409Z E       ^
2025-05-07T20:33:31.2450781Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2450785Z 
2025-05-07T20:33:31.2451228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2451233Z 
2025-05-07T20:33:31.2451337Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2451614Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2451696Z     T=2048,
2025-05-07T20:33:31.2451774Z     D=7168,
2025-05-07T20:33:31.2451857Z     scale_ub=1200.0,
2025-05-07T20:33:31.2451942Z     contiguous=False,
2025-05-07T20:33:31.2452023Z     compiled=True,
2025-05-07T20:33:31.2452138Z )
2025-05-07T20:33:31.2452362Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2452540Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:31.2452582Z 
2025-05-07T20:33:31.2452662Z     @given(
2025-05-07T20:33:31.2452779Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2452876Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2452993Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2453108Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2453223Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2453295Z     )
2025-05-07T20:33:31.2453546Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2453639Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2453714Z         self,
2025-05-07T20:33:31.2453791Z         T: int,
2025-05-07T20:33:31.2453871Z         D: int,
2025-05-07T20:33:31.2453968Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2454052Z         contiguous: bool,
2025-05-07T20:33:31.2454138Z         compiled: bool,
2025-05-07T20:33:31.2454257Z     ) -> None:
2025-05-07T20:33:31.2454358Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2454433Z     
2025-05-07T20:33:31.2454702Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2454792Z     
2025-05-07T20:33:31.2454890Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2455014Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2455108Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2455192Z         x0 = x[:, :D]
2025-05-07T20:33:31.2455273Z         x1 = x[:, D:]
2025-05-07T20:33:31.2455350Z     
2025-05-07T20:33:31.2455433Z         if contiguous:
2025-05-07T20:33:31.2455525Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2455617Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2455688Z     
2025-05-07T20:33:31.2455788Z         if scale_ub is not None:
2025-05-07T20:33:31.2455893Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2456031Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2456112Z             )
2025-05-07T20:33:31.2456192Z         else:
2025-05-07T20:33:31.2456287Z             scale_ub_tensor = None
2025-05-07T20:33:31.2456369Z     
2025-05-07T20:33:31.2456495Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2456582Z             op = silu_mul_quant
2025-05-07T20:33:31.2456663Z             if compiled:
2025-05-07T20:33:31.2456760Z                 op = torch.compile(op)
2025-05-07T20:33:31.2456866Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2456940Z     
2025-05-07T20:33:31.2457027Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2457031Z 
2025-05-07T20:33:31.2457127Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2457261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2457358Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2457460Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2457845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2457936Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2458458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2458552Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2458929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2459206Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2459563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2459658Z     kernel = self.compile(
2025-05-07T20:33:31.2460097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2460284Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2460453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2460458Z 
2025-05-07T20:33:31.2460667Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4d02270>
2025-05-07T20:33:31.2461482Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2462003Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4d58ea0>}
2025-05-07T20:33:31.2462805Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2463038Z context = <triton._C.libtriton.ir.context object at 0x7f08e4d287f0>
2025-05-07T20:33:31.2463046Z 
2025-05-07T20:33:31.2463215Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2463492Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2463599Z                            module_map=module_map)
2025-05-07T20:33:31.2463765Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2463867Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2463946Z E       ^
2025-05-07T20:33:31.2464333Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2464339Z 
2025-05-07T20:33:31.2464821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2464826Z 
2025-05-07T20:33:31.2464936Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2465166Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2465246Z     T=1,
2025-05-07T20:33:31.2465323Z     D=5120,
2025-05-07T20:33:31.2465401Z     scale_ub=None,
2025-05-07T20:33:31.2465485Z     contiguous=False,
2025-05-07T20:33:31.2465569Z     compiled=False,
2025-05-07T20:33:31.2465638Z )
2025-05-07T20:33:31.2465860Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2466033Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:31.2466037Z 
2025-05-07T20:33:31.2466113Z     @given(
2025-05-07T20:33:31.2466229Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2466328Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2466443Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2466560Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2466673Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2466745Z     )
2025-05-07T20:33:31.2467002Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2467091Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2467167Z         self,
2025-05-07T20:33:31.2467243Z         T: int,
2025-05-07T20:33:31.2467317Z         D: int,
2025-05-07T20:33:31.2467413Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2467501Z         contiguous: bool,
2025-05-07T20:33:31.2467629Z         compiled: bool,
2025-05-07T20:33:31.2467708Z     ) -> None:
2025-05-07T20:33:31.2467800Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2467869Z     
2025-05-07T20:33:31.2468040Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2468112Z     
2025-05-07T20:33:31.2468241Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2468368Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2468452Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2468531Z         x0 = x[:, :D]
2025-05-07T20:33:31.2468656Z         x1 = x[:, D:]
2025-05-07T20:33:31.2468730Z     
2025-05-07T20:33:31.2468815Z         if contiguous:
2025-05-07T20:33:31.2468909Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2469000Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2469075Z     
2025-05-07T20:33:31.2469170Z         if scale_ub is not None:
2025-05-07T20:33:31.2469274Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2469417Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2469491Z             )
2025-05-07T20:33:31.2469567Z         else:
2025-05-07T20:33:31.2469665Z             scale_ub_tensor = None
2025-05-07T20:33:31.2469736Z     
2025-05-07T20:33:31.2469862Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2469958Z             op = silu_mul_quant
2025-05-07T20:33:31.2470039Z             if compiled:
2025-05-07T20:33:31.2470133Z                 op = torch.compile(op)
2025-05-07T20:33:31.2470307Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2470379Z     
2025-05-07T20:33:31.2470468Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2470475Z 
2025-05-07T20:33:31.2470568Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2470701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2470803Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2470902Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2471429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2471527Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2471905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2472137Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2472497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2472590Z     kernel = self.compile(
2025-05-07T20:33:31.2472995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2473167Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2473295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2473302Z 
2025-05-07T20:33:31.2473510Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4d037d0>
2025-05-07T20:33:31.2474324Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2474842Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4d59e40>}
2025-05-07T20:33:31.2475635Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2475830Z context = <triton._C.libtriton.ir.context object at 0x7f08e48216f0>
2025-05-07T20:33:31.2475834Z 
2025-05-07T20:33:31.2476001Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2476319Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2476427Z                            module_map=module_map)
2025-05-07T20:33:31.2476626Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2476725Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2476803Z E       ^
2025-05-07T20:33:31.2477172Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2477214Z 
2025-05-07T20:33:31.2477655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2477660Z 
2025-05-07T20:33:31.2477764Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2477996Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2478084Z     T=4096,
2025-05-07T20:33:31.2478162Z     D=7168,
2025-05-07T20:33:31.2478243Z     scale_ub=1200.0,
2025-05-07T20:33:31.2478328Z     contiguous=False,
2025-05-07T20:33:31.2478410Z     compiled=False,
2025-05-07T20:33:31.2478487Z )
2025-05-07T20:33:31.2478707Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2478890Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:31.2478895Z 
2025-05-07T20:33:31.2478973Z     @given(
2025-05-07T20:33:31.2479127Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2479230Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2479346Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2479460Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2479570Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2479646Z     )
2025-05-07T20:33:31.2479897Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2479997Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2480072Z         self,
2025-05-07T20:33:31.2480147Z         T: int,
2025-05-07T20:33:31.2480222Z         D: int,
2025-05-07T20:33:31.2480317Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2480405Z         contiguous: bool,
2025-05-07T20:33:31.2480496Z         compiled: bool,
2025-05-07T20:33:31.2480569Z     ) -> None:
2025-05-07T20:33:31.2480660Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2480735Z     
2025-05-07T20:33:31.2480906Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2480978Z     
2025-05-07T20:33:31.2481069Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2481190Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2481280Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2481355Z         x0 = x[:, :D]
2025-05-07T20:33:31.2481431Z         x1 = x[:, D:]
2025-05-07T20:33:31.2481505Z     
2025-05-07T20:33:31.2481586Z         if contiguous:
2025-05-07T20:33:31.2481675Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2481763Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2481833Z     
2025-05-07T20:33:31.2481923Z         if scale_ub is not None:
2025-05-07T20:33:31.2482027Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2482162Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2482234Z             )
2025-05-07T20:33:31.2482311Z         else:
2025-05-07T20:33:31.2482404Z             scale_ub_tensor = None
2025-05-07T20:33:31.2482481Z     
2025-05-07T20:33:31.2482607Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2482696Z             op = silu_mul_quant
2025-05-07T20:33:31.2482779Z             if compiled:
2025-05-07T20:33:31.2482874Z                 op = torch.compile(op)
2025-05-07T20:33:31.2482978Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2483053Z     
2025-05-07T20:33:31.2483141Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2483192Z 
2025-05-07T20:33:31.2483288Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2483423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2483522Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2483656Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2484189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2484288Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2484717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2484950Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2485314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2485411Z     kernel = self.compile(
2025-05-07T20:33:31.2485818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2486003Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2486140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2486144Z 
2025-05-07T20:33:31.2486354Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4d03410>
2025-05-07T20:33:31.2487216Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2487734Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4d5b380>}
2025-05-07T20:33:31.2488532Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2488728Z context = <triton._C.libtriton.ir.context object at 0x7f08e460b7b0>
2025-05-07T20:33:31.2488733Z 
2025-05-07T20:33:31.2488906Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2489184Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2489290Z                            module_map=module_map)
2025-05-07T20:33:31.2489458Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2489557Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2489632Z E       ^
2025-05-07T20:33:31.2490006Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2490011Z 
2025-05-07T20:33:31.2490445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2490452Z 
2025-05-07T20:33:31.2490560Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2490790Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2490871Z     T=16384,
2025-05-07T20:33:31.2490950Z     D=7168,
2025-05-07T20:33:31.2491032Z     scale_ub=None,
2025-05-07T20:33:31.2491117Z     contiguous=True,
2025-05-07T20:33:31.2491207Z     compiled=True,
2025-05-07T20:33:31.2491282Z )
2025-05-07T20:33:31.2491508Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2491689Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:31.2491693Z 
2025-05-07T20:33:31.2491771Z     @given(
2025-05-07T20:33:31.2491889Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2491987Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2492145Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2492265Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2492375Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2492447Z     )
2025-05-07T20:33:31.2492737Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2492828Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2492902Z         self,
2025-05-07T20:33:31.2492976Z         T: int,
2025-05-07T20:33:31.2493052Z         D: int,
2025-05-07T20:33:31.2493149Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2493275Z         contiguous: bool,
2025-05-07T20:33:31.2493357Z         compiled: bool,
2025-05-07T20:33:31.2493436Z     ) -> None:
2025-05-07T20:33:31.2493528Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2493599Z     
2025-05-07T20:33:31.2493774Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2493849Z     
2025-05-07T20:33:31.2493939Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2494064Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2494150Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2494227Z         x0 = x[:, :D]
2025-05-07T20:33:31.2494308Z         x1 = x[:, D:]
2025-05-07T20:33:31.2494379Z     
2025-05-07T20:33:31.2494523Z         if contiguous:
2025-05-07T20:33:31.2494615Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2494702Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2494822Z     
2025-05-07T20:33:31.2494911Z         if scale_ub is not None:
2025-05-07T20:33:31.2495017Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2495152Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2495226Z             )
2025-05-07T20:33:31.2495299Z         else:
2025-05-07T20:33:31.2495395Z             scale_ub_tensor = None
2025-05-07T20:33:31.2495470Z     
2025-05-07T20:33:31.2495597Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2495690Z             op = silu_mul_quant
2025-05-07T20:33:31.2495771Z             if compiled:
2025-05-07T20:33:31.2495865Z                 op = torch.compile(op)
2025-05-07T20:33:31.2495971Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2496042Z     
2025-05-07T20:33:31.2496133Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2496138Z 
2025-05-07T20:33:31.2496231Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2496363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2496464Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2496562Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2496948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2497041Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2497560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2497661Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2498036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2498264Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2498624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2498719Z     kernel = self.compile(
2025-05-07T20:33:31.2499123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2499308Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2499439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2499443Z 
2025-05-07T20:33:31.2499653Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4637470>
2025-05-07T20:33:31.2500513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2501069Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e46a84a0>}
2025-05-07T20:33:31.2501865Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2502121Z context = <triton._C.libtriton.ir.context object at 0x7f08e55322f0>
2025-05-07T20:33:31.2502126Z 
2025-05-07T20:33:31.2502299Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2502573Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2502685Z                            module_map=module_map)
2025-05-07T20:33:31.2502850Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2502949Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2503029Z E       ^
2025-05-07T20:33:31.2503403Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2503408Z 
2025-05-07T20:33:31.2503883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2503894Z 
2025-05-07T20:33:31.2503998Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2504227Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2504311Z     T=4096,
2025-05-07T20:33:31.2504388Z     D=5120,
2025-05-07T20:33:31.2504473Z     scale_ub=None,
2025-05-07T20:33:31.2504565Z     contiguous=False,
2025-05-07T20:33:31.2504650Z     compiled=True,
2025-05-07T20:33:31.2504721Z )
2025-05-07T20:33:31.2504950Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2505129Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:31.2505134Z 
2025-05-07T20:33:31.2505212Z     @given(
2025-05-07T20:33:31.2505338Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2505441Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2505560Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2505680Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2505793Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2505872Z     )
2025-05-07T20:33:31.2506127Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2506219Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2506303Z         self,
2025-05-07T20:33:31.2506385Z         T: int,
2025-05-07T20:33:31.2506459Z         D: int,
2025-05-07T20:33:31.2506561Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2506651Z         contiguous: bool,
2025-05-07T20:33:31.2506740Z         compiled: bool,
2025-05-07T20:33:31.2506814Z     ) -> None:
2025-05-07T20:33:31.2506913Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2506990Z     
2025-05-07T20:33:31.2507162Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2507240Z     
2025-05-07T20:33:31.2507335Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2507467Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2507555Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2507639Z         x0 = x[:, :D]
2025-05-07T20:33:31.2507717Z         x1 = x[:, D:]
2025-05-07T20:33:31.2507793Z     
2025-05-07T20:33:31.2507882Z         if contiguous:
2025-05-07T20:33:31.2507975Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2508065Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2508193Z     
2025-05-07T20:33:31.2508290Z         if scale_ub is not None:
2025-05-07T20:33:31.2508399Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2508534Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2508605Z             )
2025-05-07T20:33:31.2508723Z         else:
2025-05-07T20:33:31.2508816Z             scale_ub_tensor = None
2025-05-07T20:33:31.2508888Z     
2025-05-07T20:33:31.2509020Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2509107Z             op = silu_mul_quant
2025-05-07T20:33:31.2509230Z             if compiled:
2025-05-07T20:33:31.2509328Z                 op = torch.compile(op)
2025-05-07T20:33:31.2509434Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2509507Z     
2025-05-07T20:33:31.2509599Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2509604Z 
2025-05-07T20:33:31.2509696Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2509837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2509934Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2510031Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2510422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2510513Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2511073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2511180Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2511554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2511785Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2512140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2512234Z     kernel = self.compile(
2025-05-07T20:33:31.2512639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2512816Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2512950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2512955Z 
2025-05-07T20:33:31.2513162Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4634b00>
2025-05-07T20:33:31.2513975Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2514493Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e46a91c0>}
2025-05-07T20:33:31.2515289Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2515485Z context = <triton._C.libtriton.ir.context object at 0x7f08e55557b0>
2025-05-07T20:33:31.2515490Z 
2025-05-07T20:33:31.2515656Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2515927Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2516037Z                            module_map=module_map)
2025-05-07T20:33:31.2516196Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2516298Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2516370Z E       ^
2025-05-07T20:33:31.2516736Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2516785Z 
2025-05-07T20:33:31.2517227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2517232Z 
2025-05-07T20:33:31.2517335Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2521478Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2521576Z     T=4096,
2025-05-07T20:33:31.2521656Z     D=5120,
2025-05-07T20:33:31.2521743Z     scale_ub=1200.0,
2025-05-07T20:33:31.2521838Z     contiguous=False,
2025-05-07T20:33:31.2521921Z     compiled=False,
2025-05-07T20:33:31.2522042Z )
2025-05-07T20:33:31.2522282Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2522471Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:31.2522476Z 
2025-05-07T20:33:31.2522564Z     @given(
2025-05-07T20:33:31.2522685Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2522792Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2522918Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2523036Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2523151Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2523235Z     )
2025-05-07T20:33:31.2523492Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2523598Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2523719Z         self,
2025-05-07T20:33:31.2523798Z         T: int,
2025-05-07T20:33:31.2523885Z         D: int,
2025-05-07T20:33:31.2523982Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2524071Z         contiguous: bool,
2025-05-07T20:33:31.2524158Z         compiled: bool,
2025-05-07T20:33:31.2524236Z     ) -> None:
2025-05-07T20:33:31.2524331Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2524405Z     
2025-05-07T20:33:31.2524576Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2524652Z     
2025-05-07T20:33:31.2524746Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2524869Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2524964Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2525044Z         x0 = x[:, :D]
2025-05-07T20:33:31.2525124Z         x1 = x[:, D:]
2025-05-07T20:33:31.2525198Z     
2025-05-07T20:33:31.2525283Z         if contiguous:
2025-05-07T20:33:31.2525374Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2525733Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2525847Z     
2025-05-07T20:33:31.2525960Z         if scale_ub is not None:
2025-05-07T20:33:31.2526077Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2526211Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2526288Z             )
2025-05-07T20:33:31.2526368Z         else:
2025-05-07T20:33:31.2526465Z             scale_ub_tensor = None
2025-05-07T20:33:31.2526542Z     
2025-05-07T20:33:31.2526678Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2526768Z             op = silu_mul_quant
2025-05-07T20:33:31.2526857Z             if compiled:
2025-05-07T20:33:31.2526956Z                 op = torch.compile(op)
2025-05-07T20:33:31.2527061Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2527141Z     
2025-05-07T20:33:31.2527232Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2527237Z 
2025-05-07T20:33:31.2527335Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2527471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2527574Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2527674Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2528210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2528310Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2528784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2529015Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2529434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2529537Z     kernel = self.compile(
2025-05-07T20:33:31.2529945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2530128Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2530331Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2530336Z 
2025-05-07T20:33:31.2530546Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4637c20>
2025-05-07T20:33:31.2531365Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2531890Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e46aa160>}
2025-05-07T20:33:31.2532747Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2532947Z context = <triton._C.libtriton.ir.context object at 0x7f08e555a170>
2025-05-07T20:33:31.2532952Z 
2025-05-07T20:33:31.2533125Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2533402Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2533511Z                            module_map=module_map)
2025-05-07T20:33:31.2533685Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2533788Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2533868Z E       ^
2025-05-07T20:33:31.2534249Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2534254Z 
2025-05-07T20:33:31.2534761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2534766Z 
2025-05-07T20:33:31.2534880Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2535115Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2535195Z     T=4096,
2025-05-07T20:33:31.2535278Z     D=5120,
2025-05-07T20:33:31.2535360Z     scale_ub=1200.0,
2025-05-07T20:33:31.2535446Z     contiguous=False,
2025-05-07T20:33:31.2535534Z     compiled=True,
2025-05-07T20:33:31.2535606Z )
2025-05-07T20:33:31.2535831Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2536015Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:31.2536019Z 
2025-05-07T20:33:31.2536096Z     @given(
2025-05-07T20:33:31.2536221Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2536325Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2536439Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2536561Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2536673Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2536749Z     )
2025-05-07T20:33:31.2537006Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2537099Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2537174Z         self,
2025-05-07T20:33:31.2537255Z         T: int,
2025-05-07T20:33:31.2537331Z         D: int,
2025-05-07T20:33:31.2537430Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2537574Z         contiguous: bool,
2025-05-07T20:33:31.2537661Z         compiled: bool,
2025-05-07T20:33:31.2537743Z     ) -> None:
2025-05-07T20:33:31.2537840Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2537912Z     
2025-05-07T20:33:31.2538149Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2538225Z     
2025-05-07T20:33:31.2538319Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2538450Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2538541Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2538659Z         x0 = x[:, :D]
2025-05-07T20:33:31.2538744Z         x1 = x[:, D:]
2025-05-07T20:33:31.2538816Z     
2025-05-07T20:33:31.2538901Z         if contiguous:
2025-05-07T20:33:31.2538996Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2539085Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2539163Z     
2025-05-07T20:33:31.2539252Z         if scale_ub is not None:
2025-05-07T20:33:31.2539363Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2539500Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2539577Z             )
2025-05-07T20:33:31.2539653Z         else:
2025-05-07T20:33:31.2539749Z             scale_ub_tensor = None
2025-05-07T20:33:31.2539823Z     
2025-05-07T20:33:31.2539954Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2540053Z             op = silu_mul_quant
2025-05-07T20:33:31.2540179Z             if compiled:
2025-05-07T20:33:31.2540280Z                 op = torch.compile(op)
2025-05-07T20:33:31.2540392Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2540464Z     
2025-05-07T20:33:31.2540559Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2540564Z 
2025-05-07T20:33:31.2540661Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2540795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2540902Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2541005Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2541392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2541492Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2542018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2542122Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2542502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2542737Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2543097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2543192Z     kernel = self.compile(
2025-05-07T20:33:31.2543594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2543779Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2543912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2543917Z 
2025-05-07T20:33:31.2544132Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4637680>
2025-05-07T20:33:31.2544949Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2545471Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e46ab240>}
2025-05-07T20:33:31.2546270Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2546510Z context = <triton._C.libtriton.ir.context object at 0x7f08e5163cf0>
2025-05-07T20:33:31.2546515Z 
2025-05-07T20:33:31.2546726Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2546999Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2547109Z                            module_map=module_map)
2025-05-07T20:33:31.2547276Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2547413Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2547493Z E       ^
2025-05-07T20:33:31.2547868Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2547873Z 
2025-05-07T20:33:31.2548310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2548316Z 
2025-05-07T20:33:31.2548427Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2548656Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2548739Z     T=2048,
2025-05-07T20:33:31.2548815Z     D=7168,
2025-05-07T20:33:31.2548900Z     scale_ub=1200.0,
2025-05-07T20:33:31.2548992Z     contiguous=False,
2025-05-07T20:33:31.2549076Z     compiled=False,
2025-05-07T20:33:31.2549149Z )
2025-05-07T20:33:31.2549417Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2549601Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:31.2549606Z 
2025-05-07T20:33:31.2549682Z     @given(
2025-05-07T20:33:31.2549807Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2549905Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2550025Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2550145Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2550260Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2550337Z     )
2025-05-07T20:33:31.2550588Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2550686Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2550765Z         self,
2025-05-07T20:33:31.2550841Z         T: int,
2025-05-07T20:33:31.2550918Z         D: int,
2025-05-07T20:33:31.2551026Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2551118Z         contiguous: bool,
2025-05-07T20:33:31.2551202Z         compiled: bool,
2025-05-07T20:33:31.2551283Z     ) -> None:
2025-05-07T20:33:31.2551380Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2551455Z     
2025-05-07T20:33:31.2551627Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2551701Z     
2025-05-07T20:33:31.2551796Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2551922Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2552012Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2552097Z         x0 = x[:, :D]
2025-05-07T20:33:31.2552175Z         x1 = x[:, D:]
2025-05-07T20:33:31.2552247Z     
2025-05-07T20:33:31.2552332Z         if contiguous:
2025-05-07T20:33:31.2552426Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2552518Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2552593Z     
2025-05-07T20:33:31.2552689Z         if scale_ub is not None:
2025-05-07T20:33:31.2552796Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2552933Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2553007Z             )
2025-05-07T20:33:31.2553089Z         else:
2025-05-07T20:33:31.2553183Z             scale_ub_tensor = None
2025-05-07T20:33:31.2553256Z     
2025-05-07T20:33:31.2553390Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2553480Z             op = silu_mul_quant
2025-05-07T20:33:31.2553692Z             if compiled:
2025-05-07T20:33:31.2553796Z                 op = torch.compile(op)
2025-05-07T20:33:31.2553900Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2553975Z     
2025-05-07T20:33:31.2554070Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2554075Z 
2025-05-07T20:33:31.2554211Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2554350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2554456Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2554556Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2555127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2555229Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2555612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2555852Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2556212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2556313Z     kernel = self.compile(
2025-05-07T20:33:31.2556723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2556941Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2557081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2557088Z 
2025-05-07T20:33:31.2557301Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e515d820>
2025-05-07T20:33:31.2558123Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2558644Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e51a4220>}
2025-05-07T20:33:31.2559446Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2559648Z context = <triton._C.libtriton.ir.context object at 0x7f08e45117b0>
2025-05-07T20:33:31.2559657Z 
2025-05-07T20:33:31.2559828Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2560106Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2560216Z                            module_map=module_map)
2025-05-07T20:33:31.2560383Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2560492Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2560571Z E       ^
2025-05-07T20:33:31.2560945Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2560951Z 
2025-05-07T20:33:31.2561395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2561399Z 
2025-05-07T20:33:31.2561508Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2561745Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2561828Z     T=1,
2025-05-07T20:33:31.2561908Z     D=7168,
2025-05-07T20:33:31.2561993Z     scale_ub=None,
2025-05-07T20:33:31.2562077Z     contiguous=True,
2025-05-07T20:33:31.2562159Z     compiled=False,
2025-05-07T20:33:31.2562233Z )
2025-05-07T20:33:31.2562455Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2562623Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:31.2562672Z 
2025-05-07T20:33:31.2562749Z     @given(
2025-05-07T20:33:31.2562868Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2562970Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2563126Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2563244Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2563362Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2563439Z     )
2025-05-07T20:33:31.2563694Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2563828Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2563903Z         self,
2025-05-07T20:33:31.2563982Z         T: int,
2025-05-07T20:33:31.2564060Z         D: int,
2025-05-07T20:33:31.2564159Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2564250Z         contiguous: bool,
2025-05-07T20:33:31.2564335Z         compiled: bool,
2025-05-07T20:33:31.2564415Z     ) -> None:
2025-05-07T20:33:31.2564512Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2564585Z     
2025-05-07T20:33:31.2564757Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2564849Z     
2025-05-07T20:33:31.2564955Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2565107Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2565199Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2565277Z         x0 = x[:, :D]
2025-05-07T20:33:31.2565405Z         x1 = x[:, D:]
2025-05-07T20:33:31.2565484Z     
2025-05-07T20:33:31.2565569Z         if contiguous:
2025-05-07T20:33:31.2565665Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2565755Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2565829Z     
2025-05-07T20:33:31.2565926Z         if scale_ub is not None:
2025-05-07T20:33:31.2566032Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2566167Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2566249Z             )
2025-05-07T20:33:31.2566325Z         else:
2025-05-07T20:33:31.2566418Z             scale_ub_tensor = None
2025-05-07T20:33:31.2566495Z     
2025-05-07T20:33:31.2566625Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2566722Z             op = silu_mul_quant
2025-05-07T20:33:31.2566806Z             if compiled:
2025-05-07T20:33:31.2566905Z                 op = torch.compile(op)
2025-05-07T20:33:31.2567015Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2567089Z     
2025-05-07T20:33:31.2567184Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2567189Z 
2025-05-07T20:33:31.2567287Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2567420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2567524Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2567627Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2568153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2568260Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2568640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2568871Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2569237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2569332Z     kernel = self.compile(
2025-05-07T20:33:31.2569736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2569915Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2570045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2570050Z 
2025-05-07T20:33:31.2570334Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e515cef0>
2025-05-07T20:33:31.2571187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2571711Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e51a5120>}
2025-05-07T20:33:31.2572548Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2572744Z context = <triton._C.libtriton.ir.context object at 0x7f08e4816330>
2025-05-07T20:33:31.2572748Z 
2025-05-07T20:33:31.2572923Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2573199Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2573315Z                            module_map=module_map)
2025-05-07T20:33:31.2573482Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2573590Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2573674Z E       ^
2025-05-07T20:33:31.2574092Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2574098Z 
2025-05-07T20:33:31.2574603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2574612Z 
2025-05-07T20:33:31.2574719Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2574977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2575072Z     T=16384,
2025-05-07T20:33:31.2575168Z     D=7168,
2025-05-07T20:33:31.2575259Z     scale_ub=1200.0,
2025-05-07T20:33:31.2575350Z     contiguous=False,
2025-05-07T20:33:31.2575436Z     compiled=True,
2025-05-07T20:33:31.2575515Z )
2025-05-07T20:33:31.2575744Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2575932Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:31.2575937Z 
2025-05-07T20:33:31.2576021Z     @given(
2025-05-07T20:33:31.2576141Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2576239Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2576360Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2576477Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2576589Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2576671Z     )
2025-05-07T20:33:31.2576923Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2577017Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2577097Z         self,
2025-05-07T20:33:31.2577176Z         T: int,
2025-05-07T20:33:31.2577255Z         D: int,
2025-05-07T20:33:31.2577359Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2577451Z         contiguous: bool,
2025-05-07T20:33:31.2577545Z         compiled: bool,
2025-05-07T20:33:31.2577623Z     ) -> None:
2025-05-07T20:33:31.2577719Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2577795Z     
2025-05-07T20:33:31.2577969Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2578046Z     
2025-05-07T20:33:31.2578147Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2578271Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2578362Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2578442Z         x0 = x[:, :D]
2025-05-07T20:33:31.2578522Z         x1 = x[:, D:]
2025-05-07T20:33:31.2578594Z     
2025-05-07T20:33:31.2578683Z         if contiguous:
2025-05-07T20:33:31.2578826Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2578919Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2578992Z     
2025-05-07T20:33:31.2579082Z         if scale_ub is not None:
2025-05-07T20:33:31.2579190Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2579364Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2579440Z             )
2025-05-07T20:33:31.2579516Z         else:
2025-05-07T20:33:31.2579616Z             scale_ub_tensor = None
2025-05-07T20:33:31.2579690Z     
2025-05-07T20:33:31.2579829Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2579964Z             op = silu_mul_quant
2025-05-07T20:33:31.2580056Z             if compiled:
2025-05-07T20:33:31.2580157Z                 op = torch.compile(op)
2025-05-07T20:33:31.2580267Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2580342Z     
2025-05-07T20:33:31.2580434Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2580440Z 
2025-05-07T20:33:31.2580540Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2580682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2580786Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2580890Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2581335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2581430Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2582075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2582179Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2582604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2582862Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2583268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2583366Z     kernel = self.compile(
2025-05-07T20:33:31.2583823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2584019Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2584159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2584166Z 
2025-05-07T20:33:31.2584396Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e515e180>
2025-05-07T20:33:31.2585366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2585982Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e51a6520>}
2025-05-07T20:33:31.2586906Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2587128Z context = <triton._C.libtriton.ir.context object at 0x7f08e48a2c70>
2025-05-07T20:33:31.2587133Z 
2025-05-07T20:33:31.2587315Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2587625Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2587737Z                            module_map=module_map)
2025-05-07T20:33:31.2587912Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2588016Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2588092Z E       ^
2025-05-07T20:33:31.2588515Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2588600Z 
2025-05-07T20:33:31.2589040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2589045Z 
2025-05-07T20:33:31.2589193Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2589430Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2589512Z     T=1,
2025-05-07T20:33:31.2589596Z     D=7168,
2025-05-07T20:33:31.2589684Z     scale_ub=None,
2025-05-07T20:33:31.2590371Z     contiguous=False,
2025-05-07T20:33:31.2590459Z     compiled=False,
2025-05-07T20:33:31.2590541Z )
2025-05-07T20:33:31.2590770Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2590943Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:31.2590948Z 
2025-05-07T20:33:31.2591032Z     @given(
2025-05-07T20:33:31.2591155Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2591259Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2591377Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2591494Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2591619Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2591699Z     )
2025-05-07T20:33:31.2591952Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2592097Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2592177Z         self,
2025-05-07T20:33:31.2592257Z         T: int,
2025-05-07T20:33:31.2592331Z         D: int,
2025-05-07T20:33:31.2592428Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2592520Z         contiguous: bool,
2025-05-07T20:33:31.2592604Z         compiled: bool,
2025-05-07T20:33:31.2592680Z     ) -> None:
2025-05-07T20:33:31.2592780Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2592856Z     
2025-05-07T20:33:31.2593028Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2593106Z     
2025-05-07T20:33:31.2593199Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2593322Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2593417Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2593498Z         x0 = x[:, :D]
2025-05-07T20:33:31.2593581Z         x1 = x[:, D:]
2025-05-07T20:33:31.2593653Z     
2025-05-07T20:33:31.2593739Z         if contiguous:
2025-05-07T20:33:31.2593834Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2593926Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2593999Z     
2025-05-07T20:33:31.2594092Z         if scale_ub is not None:
2025-05-07T20:33:31.2594197Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2594329Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2594410Z             )
2025-05-07T20:33:31.2594485Z         else:
2025-05-07T20:33:31.2594580Z             scale_ub_tensor = None
2025-05-07T20:33:31.2594655Z     
2025-05-07T20:33:31.2594786Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2594878Z             op = silu_mul_quant
2025-05-07T20:33:31.2594966Z             if compiled:
2025-05-07T20:33:31.2595067Z                 op = torch.compile(op)
2025-05-07T20:33:31.2595174Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2595246Z     
2025-05-07T20:33:31.2595335Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2595342Z 
2025-05-07T20:33:31.2595441Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2595575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2595673Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2595775Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2596300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2596452Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2596830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2597056Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2597459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2597555Z     kernel = self.compile(
2025-05-07T20:33:31.2597960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2598178Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2598308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2598313Z 
2025-05-07T20:33:31.2598524Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e515f860>
2025-05-07T20:33:31.2599337Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2599858Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e51a7100>}
2025-05-07T20:33:31.2600714Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2600911Z context = <triton._C.libtriton.ir.context object at 0x7f08e4884b30>
2025-05-07T20:33:31.2600915Z 
2025-05-07T20:33:31.2601089Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2601358Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2601471Z                            module_map=module_map)
2025-05-07T20:33:31.2601633Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2601732Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2601811Z E       ^
2025-05-07T20:33:31.2602185Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2602190Z 
2025-05-07T20:33:31.2602628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2602635Z 
2025-05-07T20:33:31.2602742Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2602969Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2603049Z     T=2048,
2025-05-07T20:33:31.2603126Z     D=7168,
2025-05-07T20:33:31.2603210Z     scale_ub=None,
2025-05-07T20:33:31.2603298Z     contiguous=False,
2025-05-07T20:33:31.2603383Z     compiled=True,
2025-05-07T20:33:31.2603456Z )
2025-05-07T20:33:31.2603682Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2603858Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:31.2603863Z 
2025-05-07T20:33:31.2603941Z     @given(
2025-05-07T20:33:31.2604067Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2604164Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2604283Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2604396Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2604507Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2604582Z     )
2025-05-07T20:33:31.2604831Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2604924Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2604999Z         self,
2025-05-07T20:33:31.2605074Z         T: int,
2025-05-07T20:33:31.2605198Z         D: int,
2025-05-07T20:33:31.2605297Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2605383Z         contiguous: bool,
2025-05-07T20:33:31.2605466Z         compiled: bool,
2025-05-07T20:33:31.2605545Z     ) -> None:
2025-05-07T20:33:31.2605636Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2605750Z     
2025-05-07T20:33:31.2605921Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2605994Z     
2025-05-07T20:33:31.2606092Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2606212Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2606335Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2606417Z         x0 = x[:, :D]
2025-05-07T20:33:31.2606495Z         x1 = x[:, D:]
2025-05-07T20:33:31.2606564Z     
2025-05-07T20:33:31.2606651Z         if contiguous:
2025-05-07T20:33:31.2606739Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2606826Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2606899Z     
2025-05-07T20:33:31.2606985Z         if scale_ub is not None:
2025-05-07T20:33:31.2607090Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2607219Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2607295Z             )
2025-05-07T20:33:31.2607372Z         else:
2025-05-07T20:33:31.2607466Z             scale_ub_tensor = None
2025-05-07T20:33:31.2607534Z     
2025-05-07T20:33:31.2607663Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2607793Z             op = silu_mul_quant
2025-05-07T20:33:31.2607877Z             if compiled:
2025-05-07T20:33:31.2607979Z                 op = torch.compile(op)
2025-05-07T20:33:31.2608082Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2608153Z     
2025-05-07T20:33:31.2608243Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2608248Z 
2025-05-07T20:33:31.2608339Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2608471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2608573Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2608671Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2609057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2609150Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2609673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2609770Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2610147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2610376Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2610729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2610820Z     kernel = self.compile(
2025-05-07T20:33:31.2611227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2611401Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2611535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2611542Z 
2025-05-07T20:33:31.2611747Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e48e57c0>
2025-05-07T20:33:31.2612560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2613076Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4704720>}
2025-05-07T20:33:31.2613865Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2614107Z context = <triton._C.libtriton.ir.context object at 0x7f08e472f2b0>
2025-05-07T20:33:31.2614112Z 
2025-05-07T20:33:31.2614316Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2614648Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2614759Z                            module_map=module_map)
2025-05-07T20:33:31.2614964Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2615066Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2615140Z E       ^
2025-05-07T20:33:31.2615510Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2615515Z 
2025-05-07T20:33:31.2615956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2615960Z 
2025-05-07T20:33:31.2616062Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2616290Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2616373Z     T=4096,
2025-05-07T20:33:31.2616451Z     D=7168,
2025-05-07T20:33:31.2616533Z     scale_ub=None,
2025-05-07T20:33:31.2616617Z     contiguous=False,
2025-05-07T20:33:31.2616737Z     compiled=True,
2025-05-07T20:33:31.2616813Z )
2025-05-07T20:33:31.2617043Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2617222Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:31.2617226Z 
2025-05-07T20:33:31.2617308Z     @given(
2025-05-07T20:33:31.2617427Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2617527Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2617649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2617767Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2617883Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2617955Z     )
2025-05-07T20:33:31.2618212Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2618310Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2618385Z         self,
2025-05-07T20:33:31.2618466Z         T: int,
2025-05-07T20:33:31.2618554Z         D: int,
2025-05-07T20:33:31.2618658Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2618747Z         contiguous: bool,
2025-05-07T20:33:31.2618836Z         compiled: bool,
2025-05-07T20:33:31.2618909Z     ) -> None:
2025-05-07T20:33:31.2619000Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2619074Z     
2025-05-07T20:33:31.2619246Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2619324Z     
2025-05-07T20:33:31.2619412Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2619534Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2619622Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2619697Z         x0 = x[:, :D]
2025-05-07T20:33:31.2619774Z         x1 = x[:, D:]
2025-05-07T20:33:31.2619850Z     
2025-05-07T20:33:31.2619929Z         if contiguous:
2025-05-07T20:33:31.2620017Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2620106Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2620179Z     
2025-05-07T20:33:31.2620268Z         if scale_ub is not None:
2025-05-07T20:33:31.2620376Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2620508Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2620588Z             )
2025-05-07T20:33:31.2620665Z         else:
2025-05-07T20:33:31.2620754Z             scale_ub_tensor = None
2025-05-07T20:33:31.2620828Z     
2025-05-07T20:33:31.2620956Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2621093Z             op = silu_mul_quant
2025-05-07T20:33:31.2621178Z             if compiled:
2025-05-07T20:33:31.2621276Z                 op = torch.compile(op)
2025-05-07T20:33:31.2621382Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2621454Z     
2025-05-07T20:33:31.2621582Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2621587Z 
2025-05-07T20:33:31.2621682Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2621819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2621955Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2622055Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2622440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2622530Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2623056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2623157Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2623531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2623765Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2624123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2624252Z     kernel = self.compile(
2025-05-07T20:33:31.2624659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2624841Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2624976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2624981Z 
2025-05-07T20:33:31.2625190Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e48e67e0>
2025-05-07T20:33:31.2626292Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2626816Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4705440>}
2025-05-07T20:33:31.2627612Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2627817Z context = <triton._C.libtriton.ir.context object at 0x7f08e44f9f70>
2025-05-07T20:33:31.2627822Z 
2025-05-07T20:33:31.2627991Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2628266Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2628376Z                            module_map=module_map)
2025-05-07T20:33:31.2628539Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2628641Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2628721Z E       ^
2025-05-07T20:33:31.2629098Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2629103Z 
2025-05-07T20:33:31.2629541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2629547Z 
2025-05-07T20:33:31.2629650Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2629887Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2629965Z     T=16384,
2025-05-07T20:33:31.2630045Z     D=5120,
2025-05-07T20:33:31.2630132Z     scale_ub=1200.0,
2025-05-07T20:33:31.2630311Z     contiguous=False,
2025-05-07T20:33:31.2630397Z     compiled=False,
2025-05-07T20:33:31.2630472Z )
2025-05-07T20:33:31.2630696Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2630885Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:31.2630949Z 
2025-05-07T20:33:31.2631028Z     @given(
2025-05-07T20:33:31.2631146Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2631254Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2631368Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2631571Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2631687Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2631762Z     )
2025-05-07T20:33:31.2632020Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2632112Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2632191Z         self,
2025-05-07T20:33:31.2632272Z         T: int,
2025-05-07T20:33:31.2632349Z         D: int,
2025-05-07T20:33:31.2632445Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2632538Z         contiguous: bool,
2025-05-07T20:33:31.2632625Z         compiled: bool,
2025-05-07T20:33:31.2632703Z     ) -> None:
2025-05-07T20:33:31.2632805Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2632877Z     
2025-05-07T20:33:31.2633045Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2633179Z     
2025-05-07T20:33:31.2633271Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2633397Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2633483Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2633558Z         x0 = x[:, :D]
2025-05-07T20:33:31.2633638Z         x1 = x[:, D:]
2025-05-07T20:33:31.2633711Z     
2025-05-07T20:33:31.2633791Z         if contiguous:
2025-05-07T20:33:31.2633886Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2633975Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2634044Z     
2025-05-07T20:33:31.2634135Z         if scale_ub is not None:
2025-05-07T20:33:31.2634235Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2634366Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2634446Z             )
2025-05-07T20:33:31.2634522Z         else:
2025-05-07T20:33:31.2634614Z             scale_ub_tensor = None
2025-05-07T20:33:31.2634685Z     
2025-05-07T20:33:31.2634815Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2634907Z             op = silu_mul_quant
2025-05-07T20:33:31.2634987Z             if compiled:
2025-05-07T20:33:31.2635084Z                 op = torch.compile(op)
2025-05-07T20:33:31.2635193Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2635263Z     
2025-05-07T20:33:31.2635354Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2635359Z 
2025-05-07T20:33:31.2635455Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2635587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2635686Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2635784Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2636308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2636409Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2636784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2637013Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2637374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2637466Z     kernel = self.compile(
2025-05-07T20:33:31.2637868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2638090Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2638219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2638224Z 
2025-05-07T20:33:31.2638472Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e48e6600>
2025-05-07T20:33:31.2639289Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2639845Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4706340>}
2025-05-07T20:33:31.2640640Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2640837Z context = <triton._C.libtriton.ir.context object at 0x7f08e444b7b0>
2025-05-07T20:33:31.2640842Z 
2025-05-07T20:33:31.2641015Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2641290Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2641398Z                            module_map=module_map)
2025-05-07T20:33:31.2641600Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2641706Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2641789Z E       ^
2025-05-07T20:33:31.2642160Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2642165Z 
2025-05-07T20:33:31.2642605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2642613Z 
2025-05-07T20:33:31.2642718Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2642946Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2643031Z     T=16384,
2025-05-07T20:33:31.2643108Z     D=5120,
2025-05-07T20:33:31.2643191Z     scale_ub=1200.0,
2025-05-07T20:33:31.2643281Z     contiguous=True,
2025-05-07T20:33:31.2643365Z     compiled=True,
2025-05-07T20:33:31.2643436Z )
2025-05-07T20:33:31.2643668Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2643849Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:31.2643854Z 
2025-05-07T20:33:31.2643932Z     @given(
2025-05-07T20:33:31.2644053Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2644153Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2644271Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2644389Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2644503Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2644589Z     )
2025-05-07T20:33:31.2644839Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2648837Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2648942Z         self,
2025-05-07T20:33:31.2649027Z         T: int,
2025-05-07T20:33:31.2649112Z         D: int,
2025-05-07T20:33:31.2649223Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2649323Z         contiguous: bool,
2025-05-07T20:33:31.2649418Z         compiled: bool,
2025-05-07T20:33:31.2649509Z     ) -> None:
2025-05-07T20:33:31.2649609Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2649688Z     
2025-05-07T20:33:31.2649869Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2649944Z     
2025-05-07T20:33:31.2650045Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2650172Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2650330Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2650418Z         x0 = x[:, :D]
2025-05-07T20:33:31.2650499Z         x1 = x[:, D:]
2025-05-07T20:33:31.2650574Z     
2025-05-07T20:33:31.2650666Z         if contiguous:
2025-05-07T20:33:31.2650759Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2650894Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2650974Z     
2025-05-07T20:33:31.2651066Z         if scale_ub is not None:
2025-05-07T20:33:31.2651175Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2651315Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2651431Z             )
2025-05-07T20:33:31.2651511Z         else:
2025-05-07T20:33:31.2651605Z             scale_ub_tensor = None
2025-05-07T20:33:31.2651679Z     
2025-05-07T20:33:31.2651812Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2651903Z             op = silu_mul_quant
2025-05-07T20:33:31.2651994Z             if compiled:
2025-05-07T20:33:31.2652103Z                 op = torch.compile(op)
2025-05-07T20:33:31.2652211Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2652285Z     
2025-05-07T20:33:31.2652381Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2652387Z 
2025-05-07T20:33:31.2652484Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2652625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2652726Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2652871Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2653273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2653370Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2653895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2654003Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2654383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2654719Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2655081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2655180Z     kernel = self.compile(
2025-05-07T20:33:31.2655591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2655776Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2655912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2655921Z 
2025-05-07T20:33:31.2656133Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e54d8200>
2025-05-07T20:33:31.2656943Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2657473Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e47079c0>}
2025-05-07T20:33:31.2658271Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2658475Z context = <triton._C.libtriton.ir.context object at 0x7f08e54510b0>
2025-05-07T20:33:31.2658480Z 
2025-05-07T20:33:31.2658651Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2658929Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2659046Z                            module_map=module_map)
2025-05-07T20:33:31.2659263Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2659366Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2659451Z E       ^
2025-05-07T20:33:31.2659864Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2659869Z 
2025-05-07T20:33:31.2660317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2660322Z 
2025-05-07T20:33:31.2660468Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2660701Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2660790Z     T=16384,
2025-05-07T20:33:31.2660872Z     D=5120,
2025-05-07T20:33:31.2660968Z     scale_ub=None,
2025-05-07T20:33:31.2661060Z     contiguous=False,
2025-05-07T20:33:31.2661148Z     compiled=True,
2025-05-07T20:33:31.2661231Z )
2025-05-07T20:33:31.2661466Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2661652Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:31.2661657Z 
2025-05-07T20:33:31.2661743Z     @given(
2025-05-07T20:33:31.2661869Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2661972Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2662095Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2662260Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2662389Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2662467Z     )
2025-05-07T20:33:31.2662723Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2662824Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2662905Z         self,
2025-05-07T20:33:31.2662987Z         T: int,
2025-05-07T20:33:31.2663077Z         D: int,
2025-05-07T20:33:31.2663182Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2663277Z         contiguous: bool,
2025-05-07T20:33:31.2663369Z         compiled: bool,
2025-05-07T20:33:31.2663453Z     ) -> None:
2025-05-07T20:33:31.2663550Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2663630Z     
2025-05-07T20:33:31.2663805Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2663884Z     
2025-05-07T20:33:31.2663976Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2664106Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2664203Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2664284Z         x0 = x[:, :D]
2025-05-07T20:33:31.2664364Z         x1 = x[:, D:]
2025-05-07T20:33:31.2664441Z     
2025-05-07T20:33:31.2664524Z         if contiguous:
2025-05-07T20:33:31.2664618Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2664716Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2664790Z     
2025-05-07T20:33:31.2664880Z         if scale_ub is not None:
2025-05-07T20:33:31.2664995Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2665129Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2665207Z             )
2025-05-07T20:33:31.2665290Z         else:
2025-05-07T20:33:31.2665384Z             scale_ub_tensor = None
2025-05-07T20:33:31.2665466Z     
2025-05-07T20:33:31.2665599Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2665690Z             op = silu_mul_quant
2025-05-07T20:33:31.2665783Z             if compiled:
2025-05-07T20:33:31.2665883Z                 op = torch.compile(op)
2025-05-07T20:33:31.2665993Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2666069Z     
2025-05-07T20:33:31.2666163Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2666167Z 
2025-05-07T20:33:31.2666265Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2666404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2666554Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2666657Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2667048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2667142Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2667729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2667831Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2668208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2668483Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2668841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2668938Z     kernel = self.compile(
2025-05-07T20:33:31.2669341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2669523Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2669662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2669670Z 
2025-05-07T20:33:31.2669879Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e54d90d0>
2025-05-07T20:33:31.2670738Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2671260Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e5484c20>}
2025-05-07T20:33:31.2672058Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2672262Z context = <triton._C.libtriton.ir.context object at 0x7f08e545dd30>
2025-05-07T20:33:31.2672267Z 
2025-05-07T20:33:31.2672443Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2672724Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2672837Z                            module_map=module_map)
2025-05-07T20:33:31.2673006Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2673115Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2673195Z E       ^
2025-05-07T20:33:31.2673571Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2673579Z 
2025-05-07T20:33:31.2674018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2674026Z 
2025-05-07T20:33:31.2674136Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2674375Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2674461Z     T=2048,
2025-05-07T20:33:31.2674542Z     D=5120,
2025-05-07T20:33:31.2674637Z     scale_ub=None,
2025-05-07T20:33:31.2674724Z     contiguous=False,
2025-05-07T20:33:31.2674819Z     compiled=True,
2025-05-07T20:33:31.2674903Z )
2025-05-07T20:33:31.2675129Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2675317Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:31.2675321Z 
2025-05-07T20:33:31.2675405Z     @given(
2025-05-07T20:33:31.2675527Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2675636Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2675756Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2675921Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2676045Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2676121Z     )
2025-05-07T20:33:31.2676381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2676515Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2676595Z         self,
2025-05-07T20:33:31.2676677Z         T: int,
2025-05-07T20:33:31.2676755Z         D: int,
2025-05-07T20:33:31.2676858Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2676995Z         contiguous: bool,
2025-05-07T20:33:31.2677079Z         compiled: bool,
2025-05-07T20:33:31.2677158Z     ) -> None:
2025-05-07T20:33:31.2677258Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2677331Z     
2025-05-07T20:33:31.2677503Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2677586Z     
2025-05-07T20:33:31.2677680Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2677813Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2677902Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2677981Z         x0 = x[:, :D]
2025-05-07T20:33:31.2678065Z         x1 = x[:, D:]
2025-05-07T20:33:31.2678136Z     
2025-05-07T20:33:31.2678221Z         if contiguous:
2025-05-07T20:33:31.2678324Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2678417Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2678489Z     
2025-05-07T20:33:31.2678626Z         if scale_ub is not None:
2025-05-07T20:33:31.2678737Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2678874Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2678956Z             )
2025-05-07T20:33:31.2679037Z         else:
2025-05-07T20:33:31.2679132Z             scale_ub_tensor = None
2025-05-07T20:33:31.2679208Z     
2025-05-07T20:33:31.2679341Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2679440Z             op = silu_mul_quant
2025-05-07T20:33:31.2679525Z             if compiled:
2025-05-07T20:33:31.2679626Z                 op = torch.compile(op)
2025-05-07T20:33:31.2679735Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2679807Z     
2025-05-07T20:33:31.2679900Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2679907Z 
2025-05-07T20:33:31.2680008Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2680145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2680249Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2680358Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2680745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2680846Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2681370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2681472Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2681854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2682084Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2682449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2682550Z     kernel = self.compile(
2025-05-07T20:33:31.2682958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2683145Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2683276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2683281Z 
2025-05-07T20:33:31.2683493Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e54d8cb0>
2025-05-07T20:33:31.2684309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2684939Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e54859e0>}
2025-05-07T20:33:31.2685769Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2686007Z context = <triton._C.libtriton.ir.context object at 0x7f08e498e030>
2025-05-07T20:33:31.2686011Z 
2025-05-07T20:33:31.2686189Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2686466Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2686579Z                            module_map=module_map)
2025-05-07T20:33:31.2686748Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2686852Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2686933Z E       ^
2025-05-07T20:33:31.2687313Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2687318Z 
2025-05-07T20:33:31.2687888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2687895Z 
2025-05-07T20:33:31.2688015Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2688252Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2688337Z     T=2048,
2025-05-07T20:33:31.2688420Z     D=5120,
2025-05-07T20:33:31.2688509Z     scale_ub=1200.0,
2025-05-07T20:33:31.2688596Z     contiguous=False,
2025-05-07T20:33:31.2688684Z     compiled=True,
2025-05-07T20:33:31.2688762Z )
2025-05-07T20:33:31.2688991Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2689180Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:31.2689184Z 
2025-05-07T20:33:31.2689269Z     @given(
2025-05-07T20:33:31.2689393Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2689497Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2689619Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2689748Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2689864Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2689942Z     )
2025-05-07T20:33:31.2690202Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2690299Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2690383Z         self,
2025-05-07T20:33:31.2690463Z         T: int,
2025-05-07T20:33:31.2690547Z         D: int,
2025-05-07T20:33:31.2690654Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2690747Z         contiguous: bool,
2025-05-07T20:33:31.2690839Z         compiled: bool,
2025-05-07T20:33:31.2690922Z     ) -> None:
2025-05-07T20:33:31.2691018Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2691095Z     
2025-05-07T20:33:31.2691274Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2691349Z     
2025-05-07T20:33:31.2691445Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2691576Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2691667Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2691752Z         x0 = x[:, :D]
2025-05-07T20:33:31.2691834Z         x1 = x[:, D:]
2025-05-07T20:33:31.2691908Z     
2025-05-07T20:33:31.2691995Z         if contiguous:
2025-05-07T20:33:31.2692088Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2692181Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2692307Z     
2025-05-07T20:33:31.2692399Z         if scale_ub is not None:
2025-05-07T20:33:31.2692507Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2692649Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2692725Z             )
2025-05-07T20:33:31.2692801Z         else:
2025-05-07T20:33:31.2692941Z             scale_ub_tensor = None
2025-05-07T20:33:31.2693017Z     
2025-05-07T20:33:31.2693150Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2693253Z             op = silu_mul_quant
2025-05-07T20:33:31.2693382Z             if compiled:
2025-05-07T20:33:31.2693487Z                 op = torch.compile(op)
2025-05-07T20:33:31.2693595Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2693670Z     
2025-05-07T20:33:31.2693772Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2693777Z 
2025-05-07T20:33:31.2693876Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2694012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2694121Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2694220Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2694720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2694826Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2695394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2695503Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2695883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2696115Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2696476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2696575Z     kernel = self.compile(
2025-05-07T20:33:31.2696986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2697163Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2697301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2697306Z 
2025-05-07T20:33:31.2697523Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e54db080>
2025-05-07T20:33:31.2698340Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2698868Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e5486b60>}
2025-05-07T20:33:31.2699665Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2699859Z context = <triton._C.libtriton.ir.context object at 0x7f08e4913df0>
2025-05-07T20:33:31.2699867Z 
2025-05-07T20:33:31.2700043Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2700318Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2700432Z                            module_map=module_map)
2025-05-07T20:33:31.2700595Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2700695Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2700781Z E       ^
2025-05-07T20:33:31.2701152Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2701157Z 
2025-05-07T20:33:31.2701681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2701685Z 
2025-05-07T20:33:31.2701793Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2702065Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2702152Z     T=4096,
2025-05-07T20:33:31.2702235Z     D=5120,
2025-05-07T20:33:31.2702323Z     scale_ub=1200.0,
2025-05-07T20:33:31.2702425Z     contiguous=True,
2025-05-07T20:33:31.2702514Z     compiled=True,
2025-05-07T20:33:31.2702639Z )
2025-05-07T20:33:31.2702869Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2703051Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:31.2703056Z 
2025-05-07T20:33:31.2703141Z     @given(
2025-05-07T20:33:31.2703261Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2703366Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2703494Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2703616Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2703735Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2703818Z     )
2025-05-07T20:33:31.2704078Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2704179Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2704265Z         self,
2025-05-07T20:33:31.2704390Z         T: int,
2025-05-07T20:33:31.2704476Z         D: int,
2025-05-07T20:33:31.2704582Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2704676Z         contiguous: bool,
2025-05-07T20:33:31.2704772Z         compiled: bool,
2025-05-07T20:33:31.2704855Z     ) -> None:
2025-05-07T20:33:31.2704951Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2705028Z     
2025-05-07T20:33:31.2705201Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2705282Z     
2025-05-07T20:33:31.2705381Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2705505Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2705594Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2705676Z         x0 = x[:, :D]
2025-05-07T20:33:31.2705755Z         x1 = x[:, D:]
2025-05-07T20:33:31.2705830Z     
2025-05-07T20:33:31.2705919Z         if contiguous:
2025-05-07T20:33:31.2706012Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2706107Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2706182Z     
2025-05-07T20:33:31.2706280Z         if scale_ub is not None:
2025-05-07T20:33:31.2706395Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2706532Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2706612Z             )
2025-05-07T20:33:31.2706693Z         else:
2025-05-07T20:33:31.2706790Z             scale_ub_tensor = None
2025-05-07T20:33:31.2706866Z     
2025-05-07T20:33:31.2707002Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2707098Z             op = silu_mul_quant
2025-05-07T20:33:31.2707185Z             if compiled:
2025-05-07T20:33:31.2707289Z                 op = torch.compile(op)
2025-05-07T20:33:31.2707394Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2707470Z     
2025-05-07T20:33:31.2707568Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2707573Z 
2025-05-07T20:33:31.2707671Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2707806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2707914Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2708013Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2708406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2708503Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2709032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2709183Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2709565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2709838Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2710201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2710306Z     kernel = self.compile(
2025-05-07T20:33:31.2710753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2710936Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2711079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2711084Z 
2025-05-07T20:33:31.2711297Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4927ad0>
2025-05-07T20:33:31.2712122Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2712646Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4590180>}
2025-05-07T20:33:31.2713483Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2713689Z context = <triton._C.libtriton.ir.context object at 0x7f08e45d6d30>
2025-05-07T20:33:31.2713694Z 
2025-05-07T20:33:31.2713867Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2714147Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2714266Z                            module_map=module_map)
2025-05-07T20:33:31.2714458Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2714583Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2714668Z E       ^
2025-05-07T20:33:31.2715042Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2715055Z 
2025-05-07T20:33:31.2715494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2715501Z 
2025-05-07T20:33:31.2715610Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2715846Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2715929Z     T=128,
2025-05-07T20:33:31.2716011Z     D=5120,
2025-05-07T20:33:31.2716104Z     scale_ub=1200.0,
2025-05-07T20:33:31.2716199Z     contiguous=False,
2025-05-07T20:33:31.2716286Z     compiled=True,
2025-05-07T20:33:31.2716370Z )
2025-05-07T20:33:31.2716599Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2716791Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:31.2716796Z 
2025-05-07T20:33:31.2716880Z     @given(
2025-05-07T20:33:31.2717002Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2717115Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2717236Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2717360Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2717480Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2717561Z     )
2025-05-07T20:33:31.2717823Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2717921Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2718053Z         self,
2025-05-07T20:33:31.2718138Z         T: int,
2025-05-07T20:33:31.2718221Z         D: int,
2025-05-07T20:33:31.2718321Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2718415Z         contiguous: bool,
2025-05-07T20:33:31.2718501Z         compiled: bool,
2025-05-07T20:33:31.2718579Z     ) -> None:
2025-05-07T20:33:31.2718729Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2718811Z     
2025-05-07T20:33:31.2718990Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2719074Z     
2025-05-07T20:33:31.2719211Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2719340Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2719435Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2719518Z         x0 = x[:, :D]
2025-05-07T20:33:31.2719605Z         x1 = x[:, D:]
2025-05-07T20:33:31.2719681Z     
2025-05-07T20:33:31.2719769Z         if contiguous:
2025-05-07T20:33:31.2719870Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2719967Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2720047Z     
2025-05-07T20:33:31.2720145Z         if scale_ub is not None:
2025-05-07T20:33:31.2720256Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2720395Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2720482Z             )
2025-05-07T20:33:31.2720564Z         else:
2025-05-07T20:33:31.2720661Z             scale_ub_tensor = None
2025-05-07T20:33:31.2720742Z     
2025-05-07T20:33:31.2720920Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2721022Z             op = silu_mul_quant
2025-05-07T20:33:31.2721109Z             if compiled:
2025-05-07T20:33:31.2721208Z                 op = torch.compile(op)
2025-05-07T20:33:31.2721320Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2721393Z     
2025-05-07T20:33:31.2721485Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2721489Z 
2025-05-07T20:33:31.2721592Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2721728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2721828Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2721932Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2722321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2722421Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2722947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2723047Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2723426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2723657Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2724016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2724116Z     kernel = self.compile(
2025-05-07T20:33:31.2724564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2724759Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2724892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2724896Z 
2025-05-07T20:33:31.2725105Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4925850>
2025-05-07T20:33:31.2726200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2726724Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4590ea0>}
2025-05-07T20:33:31.2727613Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2727873Z context = <triton._C.libtriton.ir.context object at 0x7f08e429aaf0>
2025-05-07T20:33:31.2727879Z 
2025-05-07T20:33:31.2728057Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2728338Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2728510Z                            module_map=module_map)
2025-05-07T20:33:31.2728681Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2728788Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2728869Z E       ^
2025-05-07T20:33:31.2729250Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2729256Z 
2025-05-07T20:33:31.2729695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2729700Z 
2025-05-07T20:33:31.2729812Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2730052Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2730135Z     T=16384,
2025-05-07T20:33:31.2730221Z     D=7168,
2025-05-07T20:33:31.2730390Z     scale_ub=1200.0,
2025-05-07T20:33:31.2730483Z     contiguous=True,
2025-05-07T20:33:31.2730576Z     compiled=True,
2025-05-07T20:33:31.2730656Z )
2025-05-07T20:33:31.2730886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2731077Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:31.2731082Z 
2025-05-07T20:33:31.2731163Z     @given(
2025-05-07T20:33:31.2731287Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2731393Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2731513Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2731638Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2731754Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2731836Z     )
2025-05-07T20:33:31.2732098Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2732199Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2732285Z         self,
2025-05-07T20:33:31.2732370Z         T: int,
2025-05-07T20:33:31.2732451Z         D: int,
2025-05-07T20:33:31.2732563Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2732656Z         contiguous: bool,
2025-05-07T20:33:31.2732745Z         compiled: bool,
2025-05-07T20:33:31.2732832Z     ) -> None:
2025-05-07T20:33:31.2732932Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2733011Z     
2025-05-07T20:33:31.2733192Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2733274Z     
2025-05-07T20:33:31.2733373Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2733509Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2733602Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2733693Z         x0 = x[:, :D]
2025-05-07T20:33:31.2733786Z         x1 = x[:, D:]
2025-05-07T20:33:31.2733863Z     
2025-05-07T20:33:31.2733956Z         if contiguous:
2025-05-07T20:33:31.2734056Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2734152Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2734238Z     
2025-05-07T20:33:31.2734333Z         if scale_ub is not None:
2025-05-07T20:33:31.2734442Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2734671Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2734753Z             )
2025-05-07T20:33:31.2734834Z         else:
2025-05-07T20:33:31.2734936Z             scale_ub_tensor = None
2025-05-07T20:33:31.2735066Z     
2025-05-07T20:33:31.2735199Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2735299Z             op = silu_mul_quant
2025-05-07T20:33:31.2735387Z             if compiled:
2025-05-07T20:33:31.2735497Z                 op = torch.compile(op)
2025-05-07T20:33:31.2735649Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2735730Z     
2025-05-07T20:33:31.2735832Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2735836Z 
2025-05-07T20:33:31.2735940Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2736075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2736225Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2736328Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2736718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2736822Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2737351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2737456Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2737837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2738067Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2738470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2738574Z     kernel = self.compile(
2025-05-07T20:33:31.2738985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2739165Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2739301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2739308Z 
2025-05-07T20:33:31.2739522Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4926f90>
2025-05-07T20:33:31.2740340Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2740864Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e45920c0>}
2025-05-07T20:33:31.2741663Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2741859Z context = <triton._C.libtriton.ir.context object at 0x7f08e4143ef0>
2025-05-07T20:33:31.2741863Z 
2025-05-07T20:33:31.2742040Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2742319Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2742434Z                            module_map=module_map)
2025-05-07T20:33:31.2742599Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2742706Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2742793Z E       ^
2025-05-07T20:33:31.2743168Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2743175Z 
2025-05-07T20:33:31.2743616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2743624Z 
2025-05-07T20:33:31.2743733Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2743968Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2744058Z     T=16384,
2025-05-07T20:33:31.2744187Z     D=5120,
2025-05-07T20:33:31.2744277Z     scale_ub=1200.0,
2025-05-07T20:33:31.2744374Z     contiguous=True,
2025-05-07T20:33:31.2744463Z     compiled=False,
2025-05-07T20:33:31.2744542Z )
2025-05-07T20:33:31.2744775Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2745003Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:31.2745008Z 
2025-05-07T20:33:31.2745094Z     @given(
2025-05-07T20:33:31.2745221Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2745365Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2745490Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2745614Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2745730Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2745812Z     )
2025-05-07T20:33:31.2746069Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2746169Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2746252Z         self,
2025-05-07T20:33:31.2746333Z         T: int,
2025-05-07T20:33:31.2746412Z         D: int,
2025-05-07T20:33:31.2746519Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2746611Z         contiguous: bool,
2025-05-07T20:33:31.2746707Z         compiled: bool,
2025-05-07T20:33:31.2746790Z     ) -> None:
2025-05-07T20:33:31.2746886Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2746962Z     
2025-05-07T20:33:31.2747177Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2747257Z     
2025-05-07T20:33:31.2747359Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2747486Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2747575Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2747658Z         x0 = x[:, :D]
2025-05-07T20:33:31.2747737Z         x1 = x[:, D:]
2025-05-07T20:33:31.2747810Z     
2025-05-07T20:33:31.2747899Z         if contiguous:
2025-05-07T20:33:31.2747994Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2748091Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2748164Z     
2025-05-07T20:33:31.2748256Z         if scale_ub is not None:
2025-05-07T20:33:31.2748365Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2748502Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2748582Z             )
2025-05-07T20:33:31.2748660Z         else:
2025-05-07T20:33:31.2748757Z             scale_ub_tensor = None
2025-05-07T20:33:31.2748830Z     
2025-05-07T20:33:31.2748968Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2749060Z             op = silu_mul_quant
2025-05-07T20:33:31.2749144Z             if compiled:
2025-05-07T20:33:31.2749247Z                 op = torch.compile(op)
2025-05-07T20:33:31.2749353Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2749430Z     
2025-05-07T20:33:31.2749525Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2749531Z 
2025-05-07T20:33:31.2749627Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2749764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2749866Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2749966Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2750508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2750610Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2750994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2751226Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2751585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2751684Z     kernel = self.compile(
2025-05-07T20:33:31.2752139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2752319Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2752457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2752501Z 
2025-05-07T20:33:31.2752715Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e515cdd0>
2025-05-07T20:33:31.2753536Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2754094Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4591a80>}
2025-05-07T20:33:31.2754895Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2755097Z context = <triton._C.libtriton.ir.context object at 0x7f08e41a2df0>
2025-05-07T20:33:31.2755101Z 
2025-05-07T20:33:31.2755279Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2755559Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2755708Z                            module_map=module_map)
2025-05-07T20:33:31.2755880Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2755989Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2756068Z E       ^
2025-05-07T20:33:31.2756446Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2756450Z 
2025-05-07T20:33:31.2756892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2756900Z 
2025-05-07T20:33:31.2757007Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2757246Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2757328Z     T=1,
2025-05-07T20:33:31.2757414Z     D=7168,
2025-05-07T20:33:31.2757503Z     scale_ub=1200.0,
2025-05-07T20:33:31.2757593Z     contiguous=False,
2025-05-07T20:33:31.2757685Z     compiled=False,
2025-05-07T20:33:31.2757765Z )
2025-05-07T20:33:31.2757992Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2758177Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:31.2758181Z 
2025-05-07T20:33:31.2758263Z     @given(
2025-05-07T20:33:31.2758385Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2758498Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2758617Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2758744Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2758862Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2758940Z     )
2025-05-07T20:33:31.2759202Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2759299Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2759380Z         self,
2025-05-07T20:33:31.2759466Z         T: int,
2025-05-07T20:33:31.2759548Z         D: int,
2025-05-07T20:33:31.2759651Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2759751Z         contiguous: bool,
2025-05-07T20:33:31.2759843Z         compiled: bool,
2025-05-07T20:33:31.2759925Z     ) -> None:
2025-05-07T20:33:31.2760026Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2760103Z     
2025-05-07T20:33:31.2760279Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2760359Z     
2025-05-07T20:33:31.2760458Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2760635Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2760728Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2760811Z         x0 = x[:, :D]
2025-05-07T20:33:31.2760900Z         x1 = x[:, D:]
2025-05-07T20:33:31.2760976Z     
2025-05-07T20:33:31.2761103Z         if contiguous:
2025-05-07T20:33:31.2761207Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2761306Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2761385Z     
2025-05-07T20:33:31.2761487Z         if scale_ub is not None:
2025-05-07T20:33:31.2761661Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2761800Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2761883Z             )
2025-05-07T20:33:31.2761964Z         else:
2025-05-07T20:33:31.2762067Z             scale_ub_tensor = None
2025-05-07T20:33:31.2762144Z     
2025-05-07T20:33:31.2762278Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2762381Z             op = silu_mul_quant
2025-05-07T20:33:31.2762471Z             if compiled:
2025-05-07T20:33:31.2762575Z                 op = torch.compile(op)
2025-05-07T20:33:31.2762692Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2762769Z     
2025-05-07T20:33:31.2762864Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2762871Z 
2025-05-07T20:33:31.2762979Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2763157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2763265Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2763375Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2763905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2764011Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2764392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2764628Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2764996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2765094Z     kernel = self.compile(
2025-05-07T20:33:31.2765512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2765697Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2765838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2765848Z 
2025-05-07T20:33:31.2766064Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e41360f0>
2025-05-07T20:33:31.2766880Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2767407Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e41ac0e0>}
2025-05-07T20:33:31.2768204Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2768405Z context = <triton._C.libtriton.ir.context object at 0x7f08e43133b0>
2025-05-07T20:33:31.2768414Z 
2025-05-07T20:33:31.2768588Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2768864Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2768979Z                            module_map=module_map)
2025-05-07T20:33:31.2769147Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2769297Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2769388Z E       ^
2025-05-07T20:33:31.2769762Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2769766Z 
2025-05-07T20:33:31.2770249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2770255Z 
2025-05-07T20:33:31.2770369Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2770605Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2770733Z     T=4096,
2025-05-07T20:33:31.2770814Z     D=7168,
2025-05-07T20:33:31.2770901Z     scale_ub=1200.0,
2025-05-07T20:33:31.2770993Z     contiguous=False,
2025-05-07T20:33:31.2771082Z     compiled=True,
2025-05-07T20:33:31.2771162Z )
2025-05-07T20:33:31.2771393Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2771579Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:31.2771584Z 
2025-05-07T20:33:31.2771671Z     @given(
2025-05-07T20:33:31.2771794Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2771898Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2772022Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2772144Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2772302Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2772385Z     )
2025-05-07T20:33:31.2772644Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2772741Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2772824Z         self,
2025-05-07T20:33:31.2772902Z         T: int,
2025-05-07T20:33:31.2772985Z         D: int,
2025-05-07T20:33:31.2773091Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2773183Z         contiguous: bool,
2025-05-07T20:33:31.2773280Z         compiled: bool,
2025-05-07T20:33:31.2773362Z     ) -> None:
2025-05-07T20:33:31.2773461Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2773545Z     
2025-05-07T20:33:31.2773720Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2778105Z     
2025-05-07T20:33:31.2778232Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2778366Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2778459Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2778549Z         x0 = x[:, :D]
2025-05-07T20:33:31.2778628Z         x1 = x[:, D:]
2025-05-07T20:33:31.2778706Z     
2025-05-07T20:33:31.2778798Z         if contiguous:
2025-05-07T20:33:31.2778893Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2778988Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2779064Z     
2025-05-07T20:33:31.2779155Z         if scale_ub is not None:
2025-05-07T20:33:31.2779271Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2779409Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2779485Z             )
2025-05-07T20:33:31.2779567Z         else:
2025-05-07T20:33:31.2779662Z             scale_ub_tensor = None
2025-05-07T20:33:31.2779738Z     
2025-05-07T20:33:31.2779876Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2779971Z             op = silu_mul_quant
2025-05-07T20:33:31.2780059Z             if compiled:
2025-05-07T20:33:31.2780163Z                 op = torch.compile(op)
2025-05-07T20:33:31.2780272Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2780347Z     
2025-05-07T20:33:31.2780441Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2780446Z 
2025-05-07T20:33:31.2780542Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2780679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2780781Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2780882Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2781349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2781447Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2782011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2782120Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2782500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2782734Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2783135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2783233Z     kernel = self.compile(
2025-05-07T20:33:31.2783639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2783822Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2783958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2783969Z 
2025-05-07T20:33:31.2784182Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4135cd0>
2025-05-07T20:33:31.2785040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2785568Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e41ad300>}
2025-05-07T20:33:31.2786365Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2786570Z context = <triton._C.libtriton.ir.context object at 0x7f08e4315930>
2025-05-07T20:33:31.2786575Z 
2025-05-07T20:33:31.2786746Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2787025Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2787140Z                            module_map=module_map)
2025-05-07T20:33:31.2787308Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2787416Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2787501Z E       ^
2025-05-07T20:33:31.2787874Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2787879Z 
2025-05-07T20:33:31.2788321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2788326Z 
2025-05-07T20:33:31.2788436Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2788667Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2788753Z     T=128,
2025-05-07T20:33:31.2788834Z     D=7168,
2025-05-07T20:33:31.2788929Z     scale_ub=1200.0,
2025-05-07T20:33:31.2789022Z     contiguous=False,
2025-05-07T20:33:31.2789110Z     compiled=True,
2025-05-07T20:33:31.2789192Z )
2025-05-07T20:33:31.2789422Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2789601Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:31.2789607Z 
2025-05-07T20:33:31.2789698Z     @given(
2025-05-07T20:33:31.2789820Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2789924Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2790046Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2790166Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2790344Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2790424Z     )
2025-05-07T20:33:31.2790683Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2790785Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2790870Z         self,
2025-05-07T20:33:31.2790994Z         T: int,
2025-05-07T20:33:31.2791082Z         D: int,
2025-05-07T20:33:31.2791186Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2791281Z         contiguous: bool,
2025-05-07T20:33:31.2791379Z         compiled: bool,
2025-05-07T20:33:31.2791502Z     ) -> None:
2025-05-07T20:33:31.2791604Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2791688Z     
2025-05-07T20:33:31.2791862Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2791946Z     
2025-05-07T20:33:31.2792042Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2792170Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2792275Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2792359Z         x0 = x[:, :D]
2025-05-07T20:33:31.2792442Z         x1 = x[:, D:]
2025-05-07T20:33:31.2792522Z     
2025-05-07T20:33:31.2792611Z         if contiguous:
2025-05-07T20:33:31.2792710Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2792815Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2792890Z     
2025-05-07T20:33:31.2792987Z         if scale_ub is not None:
2025-05-07T20:33:31.2793099Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2793278Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2793369Z             )
2025-05-07T20:33:31.2793451Z         else:
2025-05-07T20:33:31.2793547Z             scale_ub_tensor = None
2025-05-07T20:33:31.2793630Z     
2025-05-07T20:33:31.2793763Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2793855Z             op = silu_mul_quant
2025-05-07T20:33:31.2793954Z             if compiled:
2025-05-07T20:33:31.2794061Z                 op = torch.compile(op)
2025-05-07T20:33:31.2794170Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2794256Z     
2025-05-07T20:33:31.2794350Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2794354Z 
2025-05-07T20:33:31.2794454Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2794598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2794704Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2794819Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2795261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2795361Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2795898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2796000Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2796381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2796617Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2796978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2797087Z     kernel = self.compile(
2025-05-07T20:33:31.2797494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2797674Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2797811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2797815Z 
2025-05-07T20:33:31.2798037Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e4135910>
2025-05-07T20:33:31.2798859Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2799423Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e41ae020>}
2025-05-07T20:33:31.2800263Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2800502Z context = <triton._C.libtriton.ir.context object at 0x7f08e4335230>
2025-05-07T20:33:31.2800507Z 
2025-05-07T20:33:31.2800682Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2800959Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2801069Z                            module_map=module_map)
2025-05-07T20:33:31.2801241Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2801349Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2801432Z E       ^
2025-05-07T20:33:31.2801808Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2801816Z 
2025-05-07T20:33:31.2802257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2802301Z 
2025-05-07T20:33:31.2802409Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2802650Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2802732Z     T=2048,
2025-05-07T20:33:31.2802815Z     D=7168,
2025-05-07T20:33:31.2802905Z     scale_ub=None,
2025-05-07T20:33:31.2802995Z     contiguous=True,
2025-05-07T20:33:31.2803085Z     compiled=True,
2025-05-07T20:33:31.2803162Z )
2025-05-07T20:33:31.2803388Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2803571Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:31.2803575Z 
2025-05-07T20:33:31.2803656Z     @given(
2025-05-07T20:33:31.2803780Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2803888Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2804008Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2804131Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2804257Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2804340Z     )
2025-05-07T20:33:31.2804619Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2804726Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2804825Z         self,
2025-05-07T20:33:31.2804911Z         T: int,
2025-05-07T20:33:31.2804993Z         D: int,
2025-05-07T20:33:31.2805095Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2805194Z         contiguous: bool,
2025-05-07T20:33:31.2805287Z         compiled: bool,
2025-05-07T20:33:31.2805371Z     ) -> None:
2025-05-07T20:33:31.2805478Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2805555Z     
2025-05-07T20:33:31.2805733Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2805818Z     
2025-05-07T20:33:31.2805914Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2806048Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2806144Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2806230Z         x0 = x[:, :D]
2025-05-07T20:33:31.2806321Z         x1 = x[:, D:]
2025-05-07T20:33:31.2806397Z     
2025-05-07T20:33:31.2806484Z         if contiguous:
2025-05-07T20:33:31.2806584Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2806677Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2806753Z     
2025-05-07T20:33:31.2806854Z         if scale_ub is not None:
2025-05-07T20:33:31.2806963Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2807149Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2807232Z             )
2025-05-07T20:33:31.2807312Z         else:
2025-05-07T20:33:31.2807413Z             scale_ub_tensor = None
2025-05-07T20:33:31.2807489Z     
2025-05-07T20:33:31.2807664Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2807767Z             op = silu_mul_quant
2025-05-07T20:33:31.2807856Z             if compiled:
2025-05-07T20:33:31.2807961Z                 op = torch.compile(op)
2025-05-07T20:33:31.2808136Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2808215Z     
2025-05-07T20:33:31.2808310Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2808314Z 
2025-05-07T20:33:31.2808418Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2808554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2808658Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2808769Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2809160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2809265Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2809792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2809895Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2810317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2810554Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2810922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2811020Z     kernel = self.compile(
2025-05-07T20:33:31.2811427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2811612Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2811749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2811754Z 
2025-05-07T20:33:31.2811968Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e435c890>
2025-05-07T20:33:31.2812794Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2813314Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e41af240>}
2025-05-07T20:33:31.2814112Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2814313Z context = <triton._C.libtriton.ir.context object at 0x7f08e420c170>
2025-05-07T20:33:31.2814317Z 
2025-05-07T20:33:31.2814589Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2814869Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2814975Z                            module_map=module_map)
2025-05-07T20:33:31.2815149Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2815253Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2815329Z E       ^
2025-05-07T20:33:31.2815703Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2815707Z 
2025-05-07T20:33:31.2816147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2816198Z 
2025-05-07T20:33:31.2816310Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2816541Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2816621Z     T=16384,
2025-05-07T20:33:31.2816706Z     D=5120,
2025-05-07T20:33:31.2816829Z     scale_ub=None,
2025-05-07T20:33:31.2816918Z     contiguous=False,
2025-05-07T20:33:31.2817008Z     compiled=False,
2025-05-07T20:33:31.2817079Z )
2025-05-07T20:33:31.2817308Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2817533Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:31.2817537Z 
2025-05-07T20:33:31.2817619Z     @given(
2025-05-07T20:33:31.2817745Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2817847Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2817964Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2818089Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2818206Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2818279Z     )
2025-05-07T20:33:31.2818536Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2818634Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2818715Z         self,
2025-05-07T20:33:31.2818793Z         T: int,
2025-05-07T20:33:31.2818870Z         D: int,
2025-05-07T20:33:31.2819012Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2819107Z         contiguous: bool,
2025-05-07T20:33:31.2819199Z         compiled: bool,
2025-05-07T20:33:31.2819284Z     ) -> None:
2025-05-07T20:33:31.2819377Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2819449Z     
2025-05-07T20:33:31.2819626Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2819701Z     
2025-05-07T20:33:31.2819796Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2819923Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2821879Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.2821890Z 
2025-05-07T20:33:31.2822008Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:31.2822013Z 
2025-05-07T20:33:31.2822117Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2822352Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2822433Z     T=4096,
2025-05-07T20:33:31.2822514Z     D=7168,
2025-05-07T20:33:31.2822605Z     scale_ub=1200.0,
2025-05-07T20:33:31.2822697Z     contiguous=True,
2025-05-07T20:33:31.2822786Z     compiled=True,
2025-05-07T20:33:31.2822868Z )
2025-05-07T20:33:31.2823092Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2823271Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:31.2823278Z 
2025-05-07T20:33:31.2823360Z     @given(
2025-05-07T20:33:31.2823484Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2823590Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2823709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2823829Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2823950Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2824025Z     )
2025-05-07T20:33:31.2824281Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2824433Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2824513Z         self,
2025-05-07T20:33:31.2824594Z         T: int,
2025-05-07T20:33:31.2824674Z         D: int,
2025-05-07T20:33:31.2824775Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2824870Z         contiguous: bool,
2025-05-07T20:33:31.2824997Z         compiled: bool,
2025-05-07T20:33:31.2825084Z     ) -> None:
2025-05-07T20:33:31.2825189Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2825262Z     
2025-05-07T20:33:31.2825672Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2825980Z     
2025-05-07T20:33:31.2826077Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2826206Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2828146Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.2828155Z 
2025-05-07T20:33:31.2828276Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:31.2828280Z 
2025-05-07T20:33:31.2828453Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2828684Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2828774Z     T=16384,
2025-05-07T20:33:31.2828857Z     D=7168,
2025-05-07T20:33:31.2828945Z     scale_ub=None,
2025-05-07T20:33:31.2829036Z     contiguous=False,
2025-05-07T20:33:31.2829126Z     compiled=False,
2025-05-07T20:33:31.2829201Z )
2025-05-07T20:33:31.2829433Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2829617Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:31.2829622Z 
2025-05-07T20:33:31.2829703Z     @given(
2025-05-07T20:33:31.2829833Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2829937Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2830057Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2830181Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2830301Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2830385Z     )
2025-05-07T20:33:31.2830641Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2830735Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2830816Z         self,
2025-05-07T20:33:31.2830894Z         T: int,
2025-05-07T20:33:31.2830973Z         D: int,
2025-05-07T20:33:31.2831077Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2831167Z         contiguous: bool,
2025-05-07T20:33:31.2831259Z         compiled: bool,
2025-05-07T20:33:31.2831344Z     ) -> None:
2025-05-07T20:33:31.2831437Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2831512Z     
2025-05-07T20:33:31.2831680Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2833610Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.2833625Z 
2025-05-07T20:33:31.2833740Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:31.2833814Z 
2025-05-07T20:33:31.2833915Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2834145Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2834222Z     T=2048,
2025-05-07T20:33:31.2834300Z     D=7168,
2025-05-07T20:33:31.2834383Z     scale_ub=1200.0,
2025-05-07T20:33:31.2834524Z     contiguous=True,
2025-05-07T20:33:31.2834611Z     compiled=True,
2025-05-07T20:33:31.2834688Z )
2025-05-07T20:33:31.2834909Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2835130Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:31.2835135Z 
2025-05-07T20:33:31.2835208Z     @given(
2025-05-07T20:33:31.2835322Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2835424Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2835535Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2835648Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2835764Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2835837Z     )
2025-05-07T20:33:31.2836086Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2836183Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2836263Z         self,
2025-05-07T20:33:31.2836340Z         T: int,
2025-05-07T20:33:31.2836417Z         D: int,
2025-05-07T20:33:31.2836512Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2836641Z         contiguous: bool,
2025-05-07T20:33:31.2836729Z         compiled: bool,
2025-05-07T20:33:31.2836804Z     ) -> None:
2025-05-07T20:33:31.2836898Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2836969Z     
2025-05-07T20:33:31.2837135Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2837212Z     
2025-05-07T20:33:31.2837304Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2837424Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2839342Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.2839351Z 
2025-05-07T20:33:31.2839470Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:31.2839474Z 
2025-05-07T20:33:31.2839575Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2839803Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2839884Z     T=2048,
2025-05-07T20:33:31.2839960Z     D=7168,
2025-05-07T20:33:31.2840044Z     scale_ub=None,
2025-05-07T20:33:31.2840133Z     contiguous=True,
2025-05-07T20:33:31.2840216Z     compiled=False,
2025-05-07T20:33:31.2840293Z )
2025-05-07T20:33:31.2840520Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2840696Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:31.2840700Z 
2025-05-07T20:33:31.2840776Z     @given(
2025-05-07T20:33:31.2840893Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2840990Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2841109Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2841225Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2841337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2841410Z     )
2025-05-07T20:33:31.2841665Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2841756Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2841876Z         self,
2025-05-07T20:33:31.2841955Z         T: int,
2025-05-07T20:33:31.2842036Z         D: int,
2025-05-07T20:33:31.2842132Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2842219Z         contiguous: bool,
2025-05-07T20:33:31.2842306Z         compiled: bool,
2025-05-07T20:33:31.2842448Z     ) -> None:
2025-05-07T20:33:31.2842540Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2842616Z     
2025-05-07T20:33:31.2842786Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2842859Z     
2025-05-07T20:33:31.2843001Z >       x_sign = torch.sign(x)
2025-05-07T20:33:31.2844913Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.2844922Z 
2025-05-07T20:33:31.2845040Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:31.2845045Z 
2025-05-07T20:33:31.2845145Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2845415Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2845489Z     T=1,
2025-05-07T20:33:31.2845564Z     D=7168,
2025-05-07T20:33:31.2845648Z     scale_ub=1200.0,
2025-05-07T20:33:31.2845732Z     contiguous=True,
2025-05-07T20:33:31.2845815Z     compiled=False,
2025-05-07T20:33:31.2845892Z )
2025-05-07T20:33:31.2846111Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2846276Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:31.2846283Z 
2025-05-07T20:33:31.2846364Z     @given(
2025-05-07T20:33:31.2846482Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2846580Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2846691Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2846807Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2846921Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2846992Z     )
2025-05-07T20:33:31.2847243Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2847341Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2847419Z         self,
2025-05-07T20:33:31.2847494Z         T: int,
2025-05-07T20:33:31.2847569Z         D: int,
2025-05-07T20:33:31.2847666Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2847756Z         contiguous: bool,
2025-05-07T20:33:31.2847837Z         compiled: bool,
2025-05-07T20:33:31.2847914Z     ) -> None:
2025-05-07T20:33:31.2848013Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2848087Z     
2025-05-07T20:33:31.2848255Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2848333Z     
2025-05-07T20:33:31.2848424Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2848549Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2848640Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2848715Z         x0 = x[:, :D]
2025-05-07T20:33:31.2848792Z         x1 = x[:, D:]
2025-05-07T20:33:31.2848872Z     
2025-05-07T20:33:31.2848957Z         if contiguous:
2025-05-07T20:33:31.2849049Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2849139Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2849209Z     
2025-05-07T20:33:31.2849300Z         if scale_ub is not None:
2025-05-07T20:33:31.2849403Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2849535Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2849665Z             )
2025-05-07T20:33:31.2849737Z         else:
2025-05-07T20:33:31.2849831Z             scale_ub_tensor = None
2025-05-07T20:33:31.2849904Z     
2025-05-07T20:33:31.2850033Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2850121Z             op = silu_mul_quant
2025-05-07T20:33:31.2850250Z             if compiled:
2025-05-07T20:33:31.2850350Z                 op = torch.compile(op)
2025-05-07T20:33:31.2850452Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2850532Z     
2025-05-07T20:33:31.2850623Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2850666Z 
2025-05-07T20:33:31.2850771Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2850906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2851011Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2851116Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2851641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2851739Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2852123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2852353Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2852714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2852847Z     kernel = self.compile(
2025-05-07T20:33:31.2853255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2853437Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2853567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2853571Z 
2025-05-07T20:33:31.2853789Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e435fef0>
2025-05-07T20:33:31.2854670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2855191Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4296520>}
2025-05-07T20:33:31.2855994Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2856192Z context = <triton._C.libtriton.ir.context object at 0x7f08e4487cf0>
2025-05-07T20:33:31.2856197Z 
2025-05-07T20:33:31.2856371Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2856647Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2856757Z                            module_map=module_map)
2025-05-07T20:33:31.2856927Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2857030Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2857112Z E       ^
2025-05-07T20:33:31.2857485Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2857492Z 
2025-05-07T20:33:31.2857935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2857942Z 
2025-05-07T20:33:31.2858049Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2858283Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2858363Z     T=128,
2025-05-07T20:33:31.2858445Z     D=5120,
2025-05-07T20:33:31.2858535Z     scale_ub=None,
2025-05-07T20:33:31.2858674Z     contiguous=True,
2025-05-07T20:33:31.2858759Z     compiled=False,
2025-05-07T20:33:31.2858835Z )
2025-05-07T20:33:31.2859065Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2859242Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:31.2859285Z 
2025-05-07T20:33:31.2859363Z     @given(
2025-05-07T20:33:31.2859487Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2859595Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2859710Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2859877Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2859994Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2860070Z     )
2025-05-07T20:33:31.2860323Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2860419Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2860508Z         self,
2025-05-07T20:33:31.2860587Z         T: int,
2025-05-07T20:33:31.2860666Z         D: int,
2025-05-07T20:33:31.2860766Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2860856Z         contiguous: bool,
2025-05-07T20:33:31.2860941Z         compiled: bool,
2025-05-07T20:33:31.2861023Z     ) -> None:
2025-05-07T20:33:31.2861123Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2861197Z     
2025-05-07T20:33:31.2861372Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2861489Z     
2025-05-07T20:33:31.2861588Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2861718Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2861806Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2861896Z         x0 = x[:, :D]
2025-05-07T20:33:31.2861975Z         x1 = x[:, D:]
2025-05-07T20:33:31.2862049Z     
2025-05-07T20:33:31.2862136Z         if contiguous:
2025-05-07T20:33:31.2862230Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2862327Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2862407Z     
2025-05-07T20:33:31.2862500Z         if scale_ub is not None:
2025-05-07T20:33:31.2862610Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2862751Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2862829Z             )
2025-05-07T20:33:31.2862910Z         else:
2025-05-07T20:33:31.2863011Z             scale_ub_tensor = None
2025-05-07T20:33:31.2863085Z     
2025-05-07T20:33:31.2863226Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2863320Z             op = silu_mul_quant
2025-05-07T20:33:31.2863407Z             if compiled:
2025-05-07T20:33:31.2863514Z                 op = torch.compile(op)
2025-05-07T20:33:31.2863619Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2863691Z     
2025-05-07T20:33:31.2863788Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2863792Z 
2025-05-07T20:33:31.2863891Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2864029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2864134Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2864238Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2864771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2864875Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2865258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2865498Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2865857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2865956Z     kernel = self.compile(
2025-05-07T20:33:31.2866362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2866587Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2866723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2866728Z 
2025-05-07T20:33:31.2866978Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e44ea0f0>
2025-05-07T20:33:31.2867796Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2868359Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e4297420>}
2025-05-07T20:33:31.2869154Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2869361Z context = <triton._C.libtriton.ir.context object at 0x7f08e44d1ff0>
2025-05-07T20:33:31.2869365Z 
2025-05-07T20:33:31.2869536Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2869815Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2869922Z                            module_map=module_map)
2025-05-07T20:33:31.2870126Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2870238Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2870317Z E       ^
2025-05-07T20:33:31.2870689Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2870694Z 
2025-05-07T20:33:31.2871136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2871144Z 
2025-05-07T20:33:31.2871250Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2871483Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2871563Z     T=128,
2025-05-07T20:33:31.2871641Z     D=7168,
2025-05-07T20:33:31.2871728Z     scale_ub=None,
2025-05-07T20:33:31.2871818Z     contiguous=True,
2025-05-07T20:33:31.2871901Z     compiled=False,
2025-05-07T20:33:31.2871984Z )
2025-05-07T20:33:31.2872213Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2872392Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:31.2872401Z 
2025-05-07T20:33:31.2872482Z     @given(
2025-05-07T20:33:31.2872602Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2872708Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2872823Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2872944Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2873066Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2873141Z     )
2025-05-07T20:33:31.2873395Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2873494Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2873572Z         self,
2025-05-07T20:33:31.2873654Z         T: int,
2025-05-07T20:33:31.2873731Z         D: int,
2025-05-07T20:33:31.2873833Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2873930Z         contiguous: bool,
2025-05-07T20:33:31.2874021Z         compiled: bool,
2025-05-07T20:33:31.2874099Z     ) -> None:
2025-05-07T20:33:31.2874198Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2874277Z     
2025-05-07T20:33:31.2874474Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2874566Z     
2025-05-07T20:33:31.2874673Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2874798Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2874967Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2875050Z         x0 = x[:, :D]
2025-05-07T20:33:31.2875131Z         x1 = x[:, D:]
2025-05-07T20:33:31.2875209Z     
2025-05-07T20:33:31.2875295Z         if contiguous:
2025-05-07T20:33:31.2875393Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2875525Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2875599Z     
2025-05-07T20:33:31.2875696Z         if scale_ub is not None:
2025-05-07T20:33:31.2875806Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2875944Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2876066Z             )
2025-05-07T20:33:31.2876146Z         else:
2025-05-07T20:33:31.2876241Z             scale_ub_tensor = None
2025-05-07T20:33:31.2876318Z     
2025-05-07T20:33:31.2876450Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2876542Z             op = silu_mul_quant
2025-05-07T20:33:31.2876633Z             if compiled:
2025-05-07T20:33:31.2876737Z                 op = torch.compile(op)
2025-05-07T20:33:31.2876850Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2876923Z     
2025-05-07T20:33:31.2877015Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2877020Z 
2025-05-07T20:33:31.2877121Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2877259Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2877361Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2877509Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2878042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2878141Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2878525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2878754Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2879116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2879212Z     kernel = self.compile(
2025-05-07T20:33:31.2879619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2879804Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2879938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2879945Z 
2025-05-07T20:33:31.2880160Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e44eaf30>
2025-05-07T20:33:31.2880973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2881493Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e44dc4a0>}
2025-05-07T20:33:31.2882293Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2882490Z context = <triton._C.libtriton.ir.context object at 0x7f0527edffb0>
2025-05-07T20:33:31.2882498Z 
2025-05-07T20:33:31.2882674Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2882949Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2883057Z                            module_map=module_map)
2025-05-07T20:33:31.2883228Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2883331Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2883453Z E       ^
2025-05-07T20:33:31.2883826Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2883831Z 
2025-05-07T20:33:31.2884303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2884308Z 
2025-05-07T20:33:31.2884435Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2884698Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2884783Z     T=2048,
2025-05-07T20:33:31.2884900Z     D=7168,
2025-05-07T20:33:31.2884986Z     scale_ub=1200.0,
2025-05-07T20:33:31.2885076Z     contiguous=True,
2025-05-07T20:33:31.2885161Z     compiled=False,
2025-05-07T20:33:31.2885238Z )
2025-05-07T20:33:31.2885466Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2885643Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:31.2885650Z 
2025-05-07T20:33:31.2885728Z     @given(
2025-05-07T20:33:31.2885849Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2885949Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2886068Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2886192Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2886308Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2886384Z     )
2025-05-07T20:33:31.2886681Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2886783Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2886866Z         self,
2025-05-07T20:33:31.2886942Z         T: int,
2025-05-07T20:33:31.2887021Z         D: int,
2025-05-07T20:33:31.2887130Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2887220Z         contiguous: bool,
2025-05-07T20:33:31.2887307Z         compiled: bool,
2025-05-07T20:33:31.2887396Z     ) -> None:
2025-05-07T20:33:31.2887494Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2887573Z     
2025-05-07T20:33:31.2887747Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2889675Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.2889686Z 
2025-05-07T20:33:31.2889805Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:31.2889810Z 
2025-05-07T20:33:31.2889914Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2890151Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2890232Z     T=1,
2025-05-07T20:33:31.2890314Z     D=5120,
2025-05-07T20:33:31.2890402Z     scale_ub=1200.0,
2025-05-07T20:33:31.2890490Z     contiguous=True,
2025-05-07T20:33:31.2890578Z     compiled=False,
2025-05-07T20:33:31.2890659Z )
2025-05-07T20:33:31.2890886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2891060Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:31.2891064Z 
2025-05-07T20:33:31.2891145Z     @given(
2025-05-07T20:33:31.2891263Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2891368Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2891483Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2891599Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2891715Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2891838Z     )
2025-05-07T20:33:31.2892095Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2892195Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2892273Z         self,
2025-05-07T20:33:31.2892354Z         T: int,
2025-05-07T20:33:31.2892432Z         D: int,
2025-05-07T20:33:31.2892569Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2892666Z         contiguous: bool,
2025-05-07T20:33:31.2892752Z         compiled: bool,
2025-05-07T20:33:31.2892835Z     ) -> None:
2025-05-07T20:33:31.2892935Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2893050Z     
2025-05-07T20:33:31.2893224Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2893305Z     
2025-05-07T20:33:31.2893401Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2893529Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2893623Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2893704Z         x0 = x[:, :D]
2025-05-07T20:33:31.2893792Z         x1 = x[:, D:]
2025-05-07T20:33:31.2893866Z     
2025-05-07T20:33:31.2893950Z         if contiguous:
2025-05-07T20:33:31.2894046Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2894135Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2894211Z     
2025-05-07T20:33:31.2894308Z         if scale_ub is not None:
2025-05-07T20:33:31.2894414Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2894653Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2894780Z             )
2025-05-07T20:33:31.2894860Z         else:
2025-05-07T20:33:31.2894958Z             scale_ub_tensor = None
2025-05-07T20:33:31.2895036Z     
2025-05-07T20:33:31.2895170Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2895265Z             op = silu_mul_quant
2025-05-07T20:33:31.2895356Z             if compiled:
2025-05-07T20:33:31.2895459Z                 op = torch.compile(op)
2025-05-07T20:33:31.2895570Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2895647Z     
2025-05-07T20:33:31.2895741Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2895746Z 
2025-05-07T20:33:31.2895849Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2895982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2896085Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2896192Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2896722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2896832Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2897211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2897445Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2897808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2897907Z     kernel = self.compile(
2025-05-07T20:33:31.2898310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2898496Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2898629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2898634Z 
2025-05-07T20:33:31.2898849Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e44eae70>
2025-05-07T20:33:31.2899668Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2900187Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08e44dda80>}
2025-05-07T20:33:31.2901032Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2901265Z context = <triton._C.libtriton.ir.context object at 0x7f0527e6b3b0>
2025-05-07T20:33:31.2901270Z 
2025-05-07T20:33:31.2901448Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2901724Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2901877Z                            module_map=module_map)
2025-05-07T20:33:31.2902040Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2902143Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2902223Z E       ^
2025-05-07T20:33:31.2902593Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2902601Z 
2025-05-07T20:33:31.2903037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2903042Z 
2025-05-07T20:33:31.2903148Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2903381Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2903460Z     T=2048,
2025-05-07T20:33:31.2903536Z     D=5120,
2025-05-07T20:33:31.2903685Z     scale_ub=None,
2025-05-07T20:33:31.2903777Z     contiguous=True,
2025-05-07T20:33:31.2903864Z     compiled=False,
2025-05-07T20:33:31.2903939Z )
2025-05-07T20:33:31.2904170Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2904346Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:31.2904350Z 
2025-05-07T20:33:31.2904427Z     @given(
2025-05-07T20:33:31.2908658Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2908788Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2908912Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2909039Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2909156Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2909244Z     )
2025-05-07T20:33:31.2909503Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2909606Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2909694Z         self,
2025-05-07T20:33:31.2909781Z         T: int,
2025-05-07T20:33:31.2909861Z         D: int,
2025-05-07T20:33:31.2909968Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2910066Z         contiguous: bool,
2025-05-07T20:33:31.2910154Z         compiled: bool,
2025-05-07T20:33:31.2910243Z     ) -> None:
2025-05-07T20:33:31.2910340Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2910412Z     
2025-05-07T20:33:31.2910593Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2910669Z     
2025-05-07T20:33:31.2910765Z >       x_sign = torch.sign(x)
2025-05-07T20:33:31.2912693Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.2912701Z 
2025-05-07T20:33:31.2912822Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:31.2912826Z 
2025-05-07T20:33:31.2912929Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2913161Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2913314Z     T=16384,
2025-05-07T20:33:31.2913396Z     D=5120,
2025-05-07T20:33:31.2913484Z     scale_ub=None,
2025-05-07T20:33:31.2913580Z     contiguous=True,
2025-05-07T20:33:31.2913669Z     compiled=False,
2025-05-07T20:33:31.2913748Z )
2025-05-07T20:33:31.2914022Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2914207Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:31.2914214Z 
2025-05-07T20:33:31.2914302Z     @given(
2025-05-07T20:33:31.2914465Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2914569Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2914692Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2914810Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2914926Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2915010Z     )
2025-05-07T20:33:31.2915272Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2915372Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2915455Z         self,
2025-05-07T20:33:31.2915539Z         T: int,
2025-05-07T20:33:31.2915619Z         D: int,
2025-05-07T20:33:31.2915727Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2915820Z         contiguous: bool,
2025-05-07T20:33:31.2915916Z         compiled: bool,
2025-05-07T20:33:31.2915998Z     ) -> None:
2025-05-07T20:33:31.2916137Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2916218Z     
2025-05-07T20:33:31.2916391Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2918301Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.2918313Z 
2025-05-07T20:33:31.2918434Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:31.2918438Z 
2025-05-07T20:33:31.2918540Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2918773Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2918855Z     T=4096,
2025-05-07T20:33:31.2918933Z     D=5120,
2025-05-07T20:33:31.2919024Z     scale_ub=None,
2025-05-07T20:33:31.2919108Z     contiguous=True,
2025-05-07T20:33:31.2919196Z     compiled=False,
2025-05-07T20:33:31.2919271Z )
2025-05-07T20:33:31.2919491Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2919671Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:31.2919678Z 
2025-05-07T20:33:31.2919757Z     @given(
2025-05-07T20:33:31.2919875Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2919976Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2920095Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2920210Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2920332Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2920409Z     )
2025-05-07T20:33:31.2920667Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2920763Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2920840Z         self,
2025-05-07T20:33:31.2920923Z         T: int,
2025-05-07T20:33:31.2920999Z         D: int,
2025-05-07T20:33:31.2921098Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2921191Z         contiguous: bool,
2025-05-07T20:33:31.2921278Z         compiled: bool,
2025-05-07T20:33:31.2921408Z     ) -> None:
2025-05-07T20:33:31.2921511Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2921584Z     
2025-05-07T20:33:31.2921754Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2923711Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.2923754Z 
2025-05-07T20:33:31.2923878Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:31.2923882Z 
2025-05-07T20:33:31.2923989Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2924225Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2924310Z     T=2048,
2025-05-07T20:33:31.2924388Z     D=5120,
2025-05-07T20:33:31.2924469Z     scale_ub=None,
2025-05-07T20:33:31.2924562Z     contiguous=False,
2025-05-07T20:33:31.2924653Z     compiled=False,
2025-05-07T20:33:31.2924732Z )
2025-05-07T20:33:31.2924956Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2925173Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:31.2925180Z 
2025-05-07T20:33:31.2925263Z     @given(
2025-05-07T20:33:31.2925382Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2925727Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2925886Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2926002Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2926119Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2926203Z     )
2025-05-07T20:33:31.2926457Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2926556Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2926635Z         self,
2025-05-07T20:33:31.2926712Z         T: int,
2025-05-07T20:33:31.2926790Z         D: int,
2025-05-07T20:33:31.2926889Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2926974Z         contiguous: bool,
2025-05-07T20:33:31.2927069Z         compiled: bool,
2025-05-07T20:33:31.2927146Z     ) -> None:
2025-05-07T20:33:31.2927244Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2927325Z     
2025-05-07T20:33:31.2927495Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2929403Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.2929415Z 
2025-05-07T20:33:31.2929529Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:31.2929534Z 
2025-05-07T20:33:31.2929635Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2929866Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2929945Z     T=4096,
2025-05-07T20:33:31.2930027Z     D=7168,
2025-05-07T20:33:31.2930117Z     scale_ub=None,
2025-05-07T20:33:31.2930205Z     contiguous=True,
2025-05-07T20:33:31.2930296Z     compiled=True,
2025-05-07T20:33:31.2930371Z )
2025-05-07T20:33:31.2930594Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2930856Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:31.2930860Z 
2025-05-07T20:33:31.2930936Z     @given(
2025-05-07T20:33:31.2931048Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2931148Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2931317Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2931429Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2931544Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2931675Z     )
2025-05-07T20:33:31.2931929Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2932021Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2932100Z         self,
2025-05-07T20:33:31.2932178Z         T: int,
2025-05-07T20:33:31.2932252Z         D: int,
2025-05-07T20:33:31.2932349Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2932441Z         contiguous: bool,
2025-05-07T20:33:31.2932529Z         compiled: bool,
2025-05-07T20:33:31.2932604Z     ) -> None:
2025-05-07T20:33:31.2932697Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2932769Z     
2025-05-07T20:33:31.2932936Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2935021Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.2935032Z 
2025-05-07T20:33:31.2935174Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:31.2935181Z 
2025-05-07T20:33:31.2935280Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2935505Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2935586Z     T=2048,
2025-05-07T20:33:31.2935665Z     D=5120,
2025-05-07T20:33:31.2935746Z     scale_ub=1200.0,
2025-05-07T20:33:31.2935836Z     contiguous=False,
2025-05-07T20:33:31.2935921Z     compiled=False,
2025-05-07T20:33:31.2935992Z )
2025-05-07T20:33:31.2936215Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2936392Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:31.2936399Z 
2025-05-07T20:33:31.2936479Z     @given(
2025-05-07T20:33:31.2936592Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2936689Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2936810Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2936926Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2937039Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2937117Z     )
2025-05-07T20:33:31.2937367Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2937459Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2937542Z         self,
2025-05-07T20:33:31.2937619Z         T: int,
2025-05-07T20:33:31.2937693Z         D: int,
2025-05-07T20:33:31.2937794Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2937884Z         contiguous: bool,
2025-05-07T20:33:31.2937967Z         compiled: bool,
2025-05-07T20:33:31.2938045Z     ) -> None:
2025-05-07T20:33:31.2938139Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2938214Z     
2025-05-07T20:33:31.2938383Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2940327Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.2940374Z 
2025-05-07T20:33:31.2940491Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:31.2940496Z 
2025-05-07T20:33:31.2940659Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2940889Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2940965Z     T=4096,
2025-05-07T20:33:31.2941043Z     D=7168,
2025-05-07T20:33:31.2941127Z     scale_ub=1200.0,
2025-05-07T20:33:31.2941210Z     contiguous=True,
2025-05-07T20:33:31.2941297Z     compiled=False,
2025-05-07T20:33:31.2941371Z )
2025-05-07T20:33:31.2941591Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2941770Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:31.2941775Z 
2025-05-07T20:33:31.2941850Z     @given(
2025-05-07T20:33:31.2941966Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2942067Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2942179Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2942332Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2942455Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2942528Z     )
2025-05-07T20:33:31.2942783Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2942874Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2942949Z         self,
2025-05-07T20:33:31.2943028Z         T: int,
2025-05-07T20:33:31.2943104Z         D: int,
2025-05-07T20:33:31.2943205Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2943296Z         contiguous: bool,
2025-05-07T20:33:31.2943378Z         compiled: bool,
2025-05-07T20:33:31.2943453Z     ) -> None:
2025-05-07T20:33:31.2943549Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2943619Z     
2025-05-07T20:33:31.2943790Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2945704Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.2945714Z 
2025-05-07T20:33:31.2945834Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:31.2945838Z 
2025-05-07T20:33:31.2945936Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2946163Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2946247Z     T=16384,
2025-05-07T20:33:31.2946326Z     D=7168,
2025-05-07T20:33:31.2946408Z     scale_ub=None,
2025-05-07T20:33:31.2946497Z     contiguous=False,
2025-05-07T20:33:31.2946584Z     compiled=True,
2025-05-07T20:33:31.2946660Z )
2025-05-07T20:33:31.2946884Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2947066Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:31.2947071Z 
2025-05-07T20:33:31.2947151Z     @given(
2025-05-07T20:33:31.2947267Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2947365Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2947480Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2947641Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2947752Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2947829Z     )
2025-05-07T20:33:31.2948076Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2948208Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2948293Z         self,
2025-05-07T20:33:31.2948376Z         T: int,
2025-05-07T20:33:31.2948458Z         D: int,
2025-05-07T20:33:31.2948566Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2948695Z         contiguous: bool,
2025-05-07T20:33:31.2948786Z         compiled: bool,
2025-05-07T20:33:31.2948861Z     ) -> None:
2025-05-07T20:33:31.2948954Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2949032Z     
2025-05-07T20:33:31.2949201Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2951108Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.2951157Z 
2025-05-07T20:33:31.2951272Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:31.2951279Z 
2025-05-07T20:33:31.2951378Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2951608Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2951685Z     T=4096,
2025-05-07T20:33:31.2951763Z     D=7168,
2025-05-07T20:33:31.2951847Z     scale_ub=None,
2025-05-07T20:33:31.2951931Z     contiguous=True,
2025-05-07T20:33:31.2952019Z     compiled=False,
2025-05-07T20:33:31.2952094Z )
2025-05-07T20:33:31.2952313Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2952486Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:31.2952491Z 
2025-05-07T20:33:31.2952569Z     @given(
2025-05-07T20:33:31.2952684Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2952782Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2952897Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2953011Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2953124Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2953196Z     )
2025-05-07T20:33:31.2953450Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2953541Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2953618Z         self,
2025-05-07T20:33:31.2953701Z         T: int,
2025-05-07T20:33:31.2953779Z         D: int,
2025-05-07T20:33:31.2953876Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2953966Z         contiguous: bool,
2025-05-07T20:33:31.2954049Z         compiled: bool,
2025-05-07T20:33:31.2954122Z     ) -> None:
2025-05-07T20:33:31.2954221Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2954296Z     
2025-05-07T20:33:31.2954465Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2956380Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.2956432Z 
2025-05-07T20:33:31.2956552Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:31.2956556Z 
2025-05-07T20:33:31.2956657Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2956921Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2957004Z     T=16384,
2025-05-07T20:33:31.2957078Z     D=7168,
2025-05-07T20:33:31.2957159Z     scale_ub=None,
2025-05-07T20:33:31.2957249Z     contiguous=True,
2025-05-07T20:33:31.2957331Z     compiled=False,
2025-05-07T20:33:31.2957448Z )
2025-05-07T20:33:31.2957673Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2957854Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:31.2957858Z 
2025-05-07T20:33:31.2957941Z     @given(
2025-05-07T20:33:31.2958060Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2958161Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2958282Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2958398Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2958511Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2958593Z     )
2025-05-07T20:33:31.2958847Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2958943Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2959027Z         self,
2025-05-07T20:33:31.2959148Z         T: int,
2025-05-07T20:33:31.2959228Z         D: int,
2025-05-07T20:33:31.2959331Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2959421Z         contiguous: bool,
2025-05-07T20:33:31.2959506Z         compiled: bool,
2025-05-07T20:33:31.2959581Z     ) -> None:
2025-05-07T20:33:31.2959679Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2959755Z     
2025-05-07T20:33:31.2959925Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2961840Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.2961851Z 
2025-05-07T20:33:31.2961965Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:31.2961969Z 
2025-05-07T20:33:31.2962070Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2962299Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2962376Z     T=16384,
2025-05-07T20:33:31.2962453Z     D=7168,
2025-05-07T20:33:31.2962538Z     scale_ub=1200.0,
2025-05-07T20:33:31.2962621Z     contiguous=True,
2025-05-07T20:33:31.2962711Z     compiled=False,
2025-05-07T20:33:31.2962786Z )
2025-05-07T20:33:31.2963005Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2963187Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:31.2963192Z 
2025-05-07T20:33:31.2963267Z     @given(
2025-05-07T20:33:31.2963380Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2963480Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2963595Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2963708Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2963823Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2963896Z     )
2025-05-07T20:33:31.2964150Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2964241Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2964363Z         self,
2025-05-07T20:33:31.2964448Z         T: int,
2025-05-07T20:33:31.2964527Z         D: int,
2025-05-07T20:33:31.2964625Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2964742Z         contiguous: bool,
2025-05-07T20:33:31.2964833Z         compiled: bool,
2025-05-07T20:33:31.2965039Z     ) -> None:
2025-05-07T20:33:31.2965136Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2965209Z     
2025-05-07T20:33:31.2965379Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2967289Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.2967339Z 
2025-05-07T20:33:31.2967457Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:31.2967461Z 
2025-05-07T20:33:31.2967564Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2967795Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2967876Z     T=128,
2025-05-07T20:33:31.2967956Z     D=5120,
2025-05-07T20:33:31.2968080Z     scale_ub=1200.0,
2025-05-07T20:33:31.2968177Z     contiguous=False,
2025-05-07T20:33:31.2968263Z     compiled=False,
2025-05-07T20:33:31.2968340Z )
2025-05-07T20:33:31.2968566Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2968742Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:31.2968746Z 
2025-05-07T20:33:31.2968829Z     @given(
2025-05-07T20:33:31.2968945Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2969047Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2969166Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2969284Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2969403Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2969485Z     )
2025-05-07T20:33:31.2969738Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2969838Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2969923Z         self,
2025-05-07T20:33:31.2970008Z         T: int,
2025-05-07T20:33:31.2970084Z         D: int,
2025-05-07T20:33:31.2970188Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2970280Z         contiguous: bool,
2025-05-07T20:33:31.2970369Z         compiled: bool,
2025-05-07T20:33:31.2970447Z     ) -> None:
2025-05-07T20:33:31.2970542Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2970618Z     
2025-05-07T20:33:31.2970787Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2970859Z     
2025-05-07T20:33:31.2970950Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2971079Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2971167Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2971249Z         x0 = x[:, :D]
2025-05-07T20:33:31.2971329Z         x1 = x[:, D:]
2025-05-07T20:33:31.2971404Z     
2025-05-07T20:33:31.2971490Z         if contiguous:
2025-05-07T20:33:31.2971582Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2971676Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2971752Z     
2025-05-07T20:33:31.2971842Z         if scale_ub is not None:
2025-05-07T20:33:31.2971947Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2972080Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2972155Z             )
2025-05-07T20:33:31.2972233Z         else:
2025-05-07T20:33:31.2972329Z             scale_ub_tensor = None
2025-05-07T20:33:31.2972450Z     
2025-05-07T20:33:31.2972580Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2972669Z             op = silu_mul_quant
2025-05-07T20:33:31.2972753Z             if compiled:
2025-05-07T20:33:31.2972849Z                 op = torch.compile(op)
2025-05-07T20:33:31.2973015Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2973087Z     
2025-05-07T20:33:31.2973176Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2973180Z 
2025-05-07T20:33:31.2973275Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2973450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2973549Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2973647Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2974175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2974273Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2974711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2974939Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2975301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2975396Z     kernel = self.compile(
2025-05-07T20:33:31.2975849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2976035Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2976165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2976169Z 
2025-05-07T20:33:31.2976376Z self = <triton.compiler.compiler.ASTSource object at 0x7f08e40fb620>
2025-05-07T20:33:31.2977189Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2977711Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0527f147c0>}
2025-05-07T20:33:31.2978510Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2978707Z context = <triton._C.libtriton.ir.context object at 0x7f0527f86db0>
2025-05-07T20:33:31.2978712Z 
2025-05-07T20:33:31.2978880Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2980415Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2980523Z                            module_map=module_map)
2025-05-07T20:33:31.2980687Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2980784Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2980863Z E       ^
2025-05-07T20:33:31.2981240Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2981245Z 
2025-05-07T20:33:31.2981680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2981686Z 
2025-05-07T20:33:31.2981791Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2982018Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2982095Z     T=2048,
2025-05-07T20:33:31.2982170Z     D=7168,
2025-05-07T20:33:31.2982247Z     scale_ub=None,
2025-05-07T20:33:31.2982335Z     contiguous=False,
2025-05-07T20:33:31.2982421Z     compiled=False,
2025-05-07T20:33:31.2982545Z )
2025-05-07T20:33:31.2982770Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2982952Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:31.2982956Z 
2025-05-07T20:33:31.2983038Z     @given(
2025-05-07T20:33:31.2983202Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2983302Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2983417Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2983536Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2983688Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2983763Z     )
2025-05-07T20:33:31.2984017Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2984110Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2984185Z         self,
2025-05-07T20:33:31.2984265Z         T: int,
2025-05-07T20:33:31.2984345Z         D: int,
2025-05-07T20:33:31.2984445Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2984532Z         contiguous: bool,
2025-05-07T20:33:31.2984616Z         compiled: bool,
2025-05-07T20:33:31.2984695Z     ) -> None:
2025-05-07T20:33:31.2984790Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2984862Z     
2025-05-07T20:33:31.2985040Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2986991Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.2987002Z 
2025-05-07T20:33:31.2987125Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:31.2987129Z 
2025-05-07T20:33:31.2987230Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2987457Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2987542Z     T=128,
2025-05-07T20:33:31.2987617Z     D=7168,
2025-05-07T20:33:31.2987702Z     scale_ub=1200.0,
2025-05-07T20:33:31.2987787Z     contiguous=True,
2025-05-07T20:33:31.2987872Z     compiled=True,
2025-05-07T20:33:31.2987949Z )
2025-05-07T20:33:31.2988171Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2988341Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:31.2988346Z 
2025-05-07T20:33:31.2988424Z     @given(
2025-05-07T20:33:31.2988539Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2988636Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2988753Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2988866Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2988979Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2989049Z     )
2025-05-07T20:33:31.2989301Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2989397Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2989471Z         self,
2025-05-07T20:33:31.2989544Z         T: int,
2025-05-07T20:33:31.2989626Z         D: int,
2025-05-07T20:33:31.2989722Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2989809Z         contiguous: bool,
2025-05-07T20:33:31.2989897Z         compiled: bool,
2025-05-07T20:33:31.2989972Z     ) -> None:
2025-05-07T20:33:31.2990063Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2990138Z     
2025-05-07T20:33:31.2990307Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2990379Z     
2025-05-07T20:33:31.2990522Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2990646Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2990735Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2990813Z         x0 = x[:, :D]
2025-05-07T20:33:31.2990892Z         x1 = x[:, D:]
2025-05-07T20:33:31.2990968Z     
2025-05-07T20:33:31.2991090Z         if contiguous:
2025-05-07T20:33:31.2991181Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2991275Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2991348Z     
2025-05-07T20:33:31.2991438Z         if scale_ub is not None:
2025-05-07T20:33:31.2991586Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2991721Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2991799Z             )
2025-05-07T20:33:31.2991880Z         else:
2025-05-07T20:33:31.2991976Z             scale_ub_tensor = None
2025-05-07T20:33:31.2992052Z     
2025-05-07T20:33:31.2992179Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2992269Z             op = silu_mul_quant
2025-05-07T20:33:31.2992355Z             if compiled:
2025-05-07T20:33:31.2992454Z                 op = torch.compile(op)
2025-05-07T20:33:31.2992558Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2992635Z     
2025-05-07T20:33:31.2992726Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2992731Z 
2025-05-07T20:33:31.2992824Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2993002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2993102Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2993208Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2993597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:31.2993687Z     return fn(*args, **kwargs)
2025-05-07T20:33:31.2994214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2994313Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2994689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2994927Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2995284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2995382Z     kernel = self.compile(
2025-05-07T20:33:31.2995783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2995965Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2996102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2996107Z 
2025-05-07T20:33:31.2996314Z self = <triton.compiler.compiler.ASTSource object at 0x7f0527f739e0>
2025-05-07T20:33:31.2997130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2997647Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f09e0c19440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0527f15940>}
2025-05-07T20:33:31.2998439Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2998640Z context = <triton._C.libtriton.ir.context object at 0x7f0527dd4170>
2025-05-07T20:33:31.2998645Z 
2025-05-07T20:33:31.2998811Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2999086Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2999238Z                            module_map=module_map)
2025-05-07T20:33:31.2999398Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2999500Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2999576Z E       ^
2025-05-07T20:33:31.2999986Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2999995Z 
2025-05-07T20:33:31.3000434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.3000475Z 
2025-05-07T20:33:31.3000578Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.3000809Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.3000889Z     T=128,
2025-05-07T20:33:31.3000961Z     D=7168,
2025-05-07T20:33:31.3001047Z     scale_ub=1200.0,
2025-05-07T20:33:31.3001134Z     contiguous=True,
2025-05-07T20:33:31.3001213Z     compiled=False,
2025-05-07T20:33:31.3001293Z )
2025-05-07T20:33:31.3001515Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.3001688Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:31.3001693Z 
2025-05-07T20:33:31.3001773Z     @given(
2025-05-07T20:33:31.3001887Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.3002028Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.3002143Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.3002261Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.3002374Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.3002447Z     )
2025-05-07T20:33:31.3002699Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.3002789Z     def test_silu_mul_quant(
2025-05-07T20:33:31.3002862Z         self,
2025-05-07T20:33:31.3002944Z         T: int,
2025-05-07T20:33:31.3003019Z         D: int,
2025-05-07T20:33:31.3003115Z         scale_ub: Optional[float],
2025-05-07T20:33:31.3003204Z         contiguous: bool,
2025-05-07T20:33:31.3003285Z         compiled: bool,
2025-05-07T20:33:31.3003360Z     ) -> None:
2025-05-07T20:33:31.3003458Z         torch.manual_seed(2025)
2025-05-07T20:33:31.3003533Z     
2025-05-07T20:33:31.3003704Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.3003785Z     
2025-05-07T20:33:31.3003876Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.3004000Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.3005911Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.3005919Z 
2025-05-07T20:33:31.3006040Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:31.3006044Z 
2025-05-07T20:33:31.3006145Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.3006377Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.3006456Z     T=128,
2025-05-07T20:33:31.3006540Z     D=5120,
2025-05-07T20:33:31.3006621Z     scale_ub=1200.0,
2025-05-07T20:33:31.3006709Z     contiguous=True,
2025-05-07T20:33:31.3006792Z     compiled=True,
2025-05-07T20:33:31.3006869Z )
2025-05-07T20:33:31.3007096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.3007263Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:31.3007336Z 
2025-05-07T20:33:31.3007422Z     @given(
2025-05-07T20:33:31.3007540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.3007640Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.3007759Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.3007915Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.3008029Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.3008105Z     )
2025-05-07T20:33:31.3008358Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.3008490Z     def test_silu_mul_quant(
2025-05-07T20:33:31.3008574Z         self,
2025-05-07T20:33:31.3008652Z         T: int,
2025-05-07T20:33:31.3008736Z         D: int,
2025-05-07T20:33:31.3008832Z         scale_ub: Optional[float],
2025-05-07T20:33:31.3008917Z         contiguous: bool,
2025-05-07T20:33:31.3009003Z         compiled: bool,
2025-05-07T20:33:31.3009079Z     ) -> None:
2025-05-07T20:33:31.3009176Z         torch.manual_seed(2025)
2025-05-07T20:33:31.3009253Z     
2025-05-07T20:33:31.3009418Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.3009491Z     
2025-05-07T20:33:31.3009585Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.3009708Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.3011652Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.3011662Z 
2025-05-07T20:33:31.3011779Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:31.3011784Z 
2025-05-07T20:33:31.3011883Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.3012112Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.3012189Z     T=128,
2025-05-07T20:33:31.3012271Z     D=7168,
2025-05-07T20:33:31.3012352Z     scale_ub=None,
2025-05-07T20:33:31.3012434Z     contiguous=True,
2025-05-07T20:33:31.3012518Z     compiled=True,
2025-05-07T20:33:31.3012595Z )
2025-05-07T20:33:31.3012816Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.3012987Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:31.3012992Z 
2025-05-07T20:33:31.3013073Z     @given(
2025-05-07T20:33:31.3013191Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.3013295Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.3013409Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.3013532Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.3013645Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.3013719Z     )
2025-05-07T20:33:31.3013972Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.3014072Z     def test_silu_mul_quant(
2025-05-07T20:33:31.3014152Z         self,
2025-05-07T20:33:31.3014235Z         T: int,
2025-05-07T20:33:31.3014313Z         D: int,
2025-05-07T20:33:31.3014417Z         scale_ub: Optional[float],
2025-05-07T20:33:31.3014573Z         contiguous: bool,
2025-05-07T20:33:31.3014664Z         compiled: bool,
2025-05-07T20:33:31.3014739Z     ) -> None:
2025-05-07T20:33:31.3014838Z         torch.manual_seed(2025)
2025-05-07T20:33:31.3014928Z     
2025-05-07T20:33:31.3015123Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.3017066Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:31.3017111Z 
2025-05-07T20:33:31.3017231Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:31.3017404Z =============================== warnings summary ===============================
2025-05-07T20:33:31.3017724Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:31.3018035Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:31.3018345Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:31.3019283Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:31.3019515Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:31.3019560Z 
2025-05-07T20:33:31.3019778Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:31.3019951Z ================= 1 failed, 1 deselected, 3 warnings in 13.45s =================
2025-05-07T20:33:33.0122728Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:33.0797047Z [EXEC] [ATTEMPT 1/2] Command attempt failed.
2025-05-07T20:33:33.0797500Z 
2025-05-07T20:33:35.0816550Z [EXEC] [ATTEMPT 2/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:33:37.2571166Z ============================= test session starts ==============================
2025-05-07T20:33:37.2572453Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:33:37.2573542Z cachedir: .pytest_cache
2025-05-07T20:33:37.2574876Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:33:37.2576166Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:33:37.2576608Z plugins: hypothesis-6.131.14
2025-05-07T20:33:38.8858267Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:33:38.9946899Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:33:38.9947486Z run-last-failure: rerun previous 1 failure
2025-05-07T20:33:38.9947799Z 
2025-05-07T20:33:41.3916935Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.3918566Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.3919428Z     T=1,
2025-05-07T20:33:41.3919805Z     D=5120,
2025-05-07T20:33:41.3920194Z     scale_ub=None,
2025-05-07T20:33:41.3920632Z     contiguous=True,
2025-05-07T20:33:41.3921070Z     compiled=True,
2025-05-07T20:33:41.3921469Z )
2025-05-07T20:33:41.3922114Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:41.3923102Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:41.3923637Z 
2025-05-07T20:33:41.3923795Z     @given(
2025-05-07T20:33:41.3924265Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:41.3933588Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:41.3933966Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:41.3934318Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:41.3934960Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:41.3935270Z     )
2025-05-07T20:33:41.3935642Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:41.3936110Z     def test_silu_mul_quant(
2025-05-07T20:33:41.3936369Z         self,
2025-05-07T20:33:41.3936668Z         T: int,
2025-05-07T20:33:41.3936859Z         D: int,
2025-05-07T20:33:41.3937107Z         scale_ub: Optional[float],
2025-05-07T20:33:41.3937416Z         contiguous: bool,
2025-05-07T20:33:41.3937658Z         compiled: bool,
2025-05-07T20:33:41.3937894Z     ) -> None:
2025-05-07T20:33:41.3938121Z         torch.manual_seed(2025)
2025-05-07T20:33:41.3938364Z     
2025-05-07T20:33:41.3938650Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:41.3939015Z     
2025-05-07T20:33:41.3939207Z         x_sign = torch.sign(x)
2025-05-07T20:33:41.3939508Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:41.3939827Z         x = x_sign * x_clamp
2025-05-07T20:33:41.3940070Z         x0 = x[:, :D]
2025-05-07T20:33:41.3940289Z         x1 = x[:, D:]
2025-05-07T20:33:41.3940504Z     
2025-05-07T20:33:41.3940689Z         if contiguous:
2025-05-07T20:33:41.3941024Z             x0 = x0.contiguous()
2025-05-07T20:33:41.3941300Z             x1 = x1.contiguous()
2025-05-07T20:33:41.3941558Z     
2025-05-07T20:33:41.3941755Z         if scale_ub is not None:
2025-05-07T20:33:41.3942050Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:41.3942403Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:41.3942719Z             )
2025-05-07T20:33:41.3942921Z         else:
2025-05-07T20:33:41.3943141Z             scale_ub_tensor = None
2025-05-07T20:33:41.3943400Z     
2025-05-07T20:33:41.3943641Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.3943967Z             op = silu_mul_quant
2025-05-07T20:33:41.3944213Z             if compiled:
2025-05-07T20:33:41.3944464Z                 op = torch.compile(op)
2025-05-07T20:33:41.3944779Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:41.3945060Z     
2025-05-07T20:33:41.3945256Z         y_fp8, y_scale = fn()
2025-05-07T20:33:41.3945548Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:41.3945846Z     
2025-05-07T20:33:41.3946083Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:41.3946433Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:41.3946742Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:41.3947084Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:41.3947485Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.3947811Z     
2025-05-07T20:33:41.3948011Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:41.3948222Z 
2025-05-07T20:33:41.3948325Z moe/activation_test.py:126: 
2025-05-07T20:33:41.3948635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.3948984Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:41.3949324Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:41.3950161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:41.3950962Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:41.3951529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:41.3952248Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:41.3952978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:41.3953880Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:41.3954643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:41.3955363Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:41.3956007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:41.3956547Z     fn()
2025-05-07T20:33:41.3957125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:41.3957747Z     self.fn.run(
2025-05-07T20:33:41.3958234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:41.3958785Z     kernel = self.compile(
2025-05-07T20:33:41.3959353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:41.3960043Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:41.3960451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:41.3960700Z 
2025-05-07T20:33:41.3960914Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c10850aa0>
2025-05-07T20:33:41.3962094Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:41.3963561Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c0b065c60>}
2025-05-07T20:33:41.3964981Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:41.3966076Z context = <triton._C.libtriton.ir.context object at 0x7f1c0b2182b0>
2025-05-07T20:33:41.3966389Z 
2025-05-07T20:33:41.3966566Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:41.3967118Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:41.3967612Z                            module_map=module_map)
2025-05-07T20:33:41.3967990Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:41.3968366Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:41.3968652Z E       ^
2025-05-07T20:33:41.3969130Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:41.3969611Z 
2025-05-07T20:33:41.3970051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:41.3970606Z 
2025-05-07T20:33:41.3970715Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:41.3971148Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:41.3971565Z     T=2048,
2025-05-07T20:33:41.3971766Z     D=5120,
2025-05-07T20:33:41.3971970Z     scale_ub=1200.0,
2025-05-07T20:33:41.3972197Z     contiguous=True,
2025-05-07T20:33:41.3972434Z     compiled=False,
2025-05-07T20:33:41.3972670Z )
2025-05-07T20:33:42.1328763Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:42.1330417Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:42.1331201Z 
2025-05-07T20:33:42.1331422Z     @given(
2025-05-07T20:33:42.1331999Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:42.1332647Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:42.1333257Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:42.1334345Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:42.1335174Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:42.1335740Z     )
2025-05-07T20:33:42.1336578Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:42.1337383Z     def test_silu_mul_quant(
2025-05-07T20:33:42.1337678Z         self,
2025-05-07T20:33:42.1337870Z         T: int,
2025-05-07T20:33:42.1338084Z         D: int,
2025-05-07T20:33:42.1338310Z         scale_ub: Optional[float],
2025-05-07T20:33:42.1338673Z         contiguous: bool,
2025-05-07T20:33:42.1338920Z         compiled: bool,
2025-05-07T20:33:42.1339149Z     ) -> None:
2025-05-07T20:33:42.1339366Z         torch.manual_seed(2025)
2025-05-07T20:33:42.1339614Z     
2025-05-07T20:33:42.1339894Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:42.1340244Z     
2025-05-07T20:33:42.1340446Z         x_sign = torch.sign(x)
2025-05-07T20:33:42.1340753Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:42.1341064Z         x = x_sign * x_clamp
2025-05-07T20:33:42.1341317Z         x0 = x[:, :D]
2025-05-07T20:33:42.1341536Z         x1 = x[:, D:]
2025-05-07T20:33:42.1341747Z     
2025-05-07T20:33:42.1341950Z         if contiguous:
2025-05-07T20:33:42.1342191Z             x0 = x0.contiguous()
2025-05-07T20:33:42.1342450Z             x1 = x1.contiguous()
2025-05-07T20:33:42.1342703Z     
2025-05-07T20:33:42.1342981Z         if scale_ub is not None:
2025-05-07T20:33:42.1343257Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:42.1343598Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:42.1343914Z             )
2025-05-07T20:33:42.1344112Z         else:
2025-05-07T20:33:42.1344319Z             scale_ub_tensor = None
2025-05-07T20:33:42.1344573Z     
2025-05-07T20:33:42.1344810Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:42.1345128Z             op = silu_mul_quant
2025-05-07T20:33:42.1345384Z             if compiled:
2025-05-07T20:33:42.1345637Z                 op = torch.compile(op)
2025-05-07T20:33:42.1345937Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.1346221Z     
2025-05-07T20:33:42.1346423Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:42.1346593Z 
2025-05-07T20:33:42.1346695Z moe/activation_test.py:117: 
2025-05-07T20:33:42.1347006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.1347362Z moe/activation_test.py:115: in fn
2025-05-07T20:33:42.1347660Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.1348381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:42.1349117Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:42.1349686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:42.1350407Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:42.1351105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:42.1351670Z     kernel = self.compile(
2025-05-07T20:33:42.1352249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:42.1352937Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:42.1353353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.1353595Z 
2025-05-07T20:33:42.1353819Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c3d3d74a0>
2025-05-07T20:33:42.1354947Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:42.1356450Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c0aebc220>}
2025-05-07T20:33:42.1357974Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:42.1359084Z context = <triton._C.libtriton.ir.context object at 0x7f1c0b237870>
2025-05-07T20:33:42.1359428Z 
2025-05-07T20:33:42.1359615Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:42.1360165Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:42.1360665Z                            module_map=module_map)
2025-05-07T20:33:42.1361046Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:42.1361429Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:42.1361695Z E       ^
2025-05-07T20:33:42.1362185Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:42.1362661Z 
2025-05-07T20:33:42.1363116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:42.1363664Z 
2025-05-07T20:33:42.1363779Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:42.1364289Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:42.1364728Z     T=2048,
2025-05-07T20:33:42.1364932Z     D=5120,
2025-05-07T20:33:42.1365137Z     scale_ub=1200.0,
2025-05-07T20:33:42.1365373Z     contiguous=True,
2025-05-07T20:33:42.1365610Z     compiled=True,
2025-05-07T20:33:42.1365822Z )
2025-05-07T20:33:42.1366164Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:42.1366688Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:42.1366976Z 
2025-05-07T20:33:42.1367060Z     @given(
2025-05-07T20:33:42.1367300Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:42.1367629Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:42.1367951Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:42.1368291Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:42.1368648Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:42.1368947Z     )
2025-05-07T20:33:42.1369309Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:42.1369790Z     def test_silu_mul_quant(
2025-05-07T20:33:42.1370034Z         self,
2025-05-07T20:33:42.1370239Z         T: int,
2025-05-07T20:33:42.1370438Z         D: int,
2025-05-07T20:33:42.1370661Z         scale_ub: Optional[float],
2025-05-07T20:33:42.1370944Z         contiguous: bool,
2025-05-07T20:33:42.1371197Z         compiled: bool,
2025-05-07T20:33:42.1371417Z     ) -> None:
2025-05-07T20:33:42.1371631Z         torch.manual_seed(2025)
2025-05-07T20:33:42.1371873Z     
2025-05-07T20:33:42.1372144Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:42.1372498Z     
2025-05-07T20:33:42.1372699Z         x_sign = torch.sign(x)
2025-05-07T20:33:42.1372984Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:42.1373303Z         x = x_sign * x_clamp
2025-05-07T20:33:42.1373547Z         x0 = x[:, :D]
2025-05-07T20:33:42.1373760Z         x1 = x[:, D:]
2025-05-07T20:33:42.1373972Z     
2025-05-07T20:33:42.1374161Z         if contiguous:
2025-05-07T20:33:42.1374387Z             x0 = x0.contiguous()
2025-05-07T20:33:42.1374738Z             x1 = x1.contiguous()
2025-05-07T20:33:42.1374985Z     
2025-05-07T20:33:42.1375186Z         if scale_ub is not None:
2025-05-07T20:33:42.1375460Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:42.1375804Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:42.1376172Z             )
2025-05-07T20:33:42.1376360Z         else:
2025-05-07T20:33:42.1376569Z             scale_ub_tensor = None
2025-05-07T20:33:42.1376824Z     
2025-05-07T20:33:42.1377047Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:42.1377414Z             op = silu_mul_quant
2025-05-07T20:33:42.1377672Z             if compiled:
2025-05-07T20:33:42.1377915Z                 op = torch.compile(op)
2025-05-07T20:33:42.1378218Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.1378542Z     
2025-05-07T20:33:42.1378727Z         y_fp8, y_scale = fn()
2025-05-07T20:33:42.1379013Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:42.1379311Z     
2025-05-07T20:33:42.1379546Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:42.1379890Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:42.1380191Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:42.1380516Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:42.1380875Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:42.1381191Z     
2025-05-07T20:33:42.1381396Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:42.1381593Z 
2025-05-07T20:33:42.1381693Z moe/activation_test.py:126: 
2025-05-07T20:33:42.1381994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.1382386Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:42.1382712Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:42.1383537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:42.1384333Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:42.1384909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:42.1385624Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:42.1386352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:42.1387116Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:42.1387945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:42.1388617Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:42.1389250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:42.1389800Z     fn()
2025-05-07T20:33:42.1390331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:42.1390950Z     self.fn.run(
2025-05-07T20:33:42.1391445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:42.1392005Z     kernel = self.compile(
2025-05-07T20:33:42.1392563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:42.1393259Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:42.1393679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.1393922Z 
2025-05-07T20:33:42.1394147Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c0aedcb30>
2025-05-07T20:33:42.1395271Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:42.1396696Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c0aebd8a0>}
2025-05-07T20:33:42.1398200Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:42.1399293Z context = <triton._C.libtriton.ir.context object at 0x7f1c09a72830>
2025-05-07T20:33:42.1399595Z 
2025-05-07T20:33:42.1399767Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:42.1400317Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:42.1400845Z                            module_map=module_map)
2025-05-07T20:33:42.1401225Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:42.1401592Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:42.1401876Z E       ^
2025-05-07T20:33:42.1402358Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:42.1402833Z 
2025-05-07T20:33:42.1403269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:42.1403818Z 
2025-05-07T20:33:42.1403925Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:42.1404349Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:42.1404765Z     T=16384,
2025-05-07T20:33:42.1405001Z     D=7168,
2025-05-07T20:33:42.1405201Z     scale_ub=1200.0,
2025-05-07T20:33:42.1405430Z     contiguous=False,
2025-05-07T20:33:42.1405651Z     compiled=False,
2025-05-07T20:33:42.1405858Z )
2025-05-07T20:33:42.8839355Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:42.8840221Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:42.8840661Z 
2025-05-07T20:33:42.8840753Z     @given(
2025-05-07T20:33:42.8841034Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:42.8841350Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:42.8841674Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:42.8842025Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:42.8842382Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:42.8842682Z     )
2025-05-07T20:33:42.8843057Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:42.8843530Z     def test_silu_mul_quant(
2025-05-07T20:33:42.8843787Z         self,
2025-05-07T20:33:42.8843993Z         T: int,
2025-05-07T20:33:42.8844201Z         D: int,
2025-05-07T20:33:42.8844425Z         scale_ub: Optional[float],
2025-05-07T20:33:42.8844712Z         contiguous: bool,
2025-05-07T20:33:42.8844960Z         compiled: bool,
2025-05-07T20:33:42.8845190Z     ) -> None:
2025-05-07T20:33:42.8845409Z         torch.manual_seed(2025)
2025-05-07T20:33:42.8845653Z     
2025-05-07T20:33:42.8845924Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:42.8846279Z     
2025-05-07T20:33:42.8846482Z         x_sign = torch.sign(x)
2025-05-07T20:33:42.8846770Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:42.8847088Z         x = x_sign * x_clamp
2025-05-07T20:33:42.8847326Z         x0 = x[:, :D]
2025-05-07T20:33:42.8847540Z         x1 = x[:, D:]
2025-05-07T20:33:42.8847741Z     
2025-05-07T20:33:42.8847941Z         if contiguous:
2025-05-07T20:33:42.8848211Z             x0 = x0.contiguous()
2025-05-07T20:33:42.8848472Z             x1 = x1.contiguous()
2025-05-07T20:33:42.8848713Z     
2025-05-07T20:33:42.8848908Z         if scale_ub is not None:
2025-05-07T20:33:42.8849175Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:42.8849515Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:42.8849829Z             )
2025-05-07T20:33:42.8850016Z         else:
2025-05-07T20:33:42.8850544Z             scale_ub_tensor = None
2025-05-07T20:33:42.8850809Z     
2025-05-07T20:33:42.8851036Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:42.8851355Z             op = silu_mul_quant
2025-05-07T20:33:42.8851605Z             if compiled:
2025-05-07T20:33:42.8851935Z                 op = torch.compile(op)
2025-05-07T20:33:42.8852246Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.8852533Z     
2025-05-07T20:33:42.8852732Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:42.8852902Z 
2025-05-07T20:33:42.8853007Z moe/activation_test.py:117: 
2025-05-07T20:33:42.8853395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.8853747Z moe/activation_test.py:115: in fn
2025-05-07T20:33:42.8854035Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.8854881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:42.8855626Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:42.8856188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:42.8856915Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:42.8857622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:42.8858241Z     kernel = self.compile(
2025-05-07T20:33:42.8858889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:42.8859593Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:42.8860008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.8860252Z 
2025-05-07T20:33:42.8860475Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c09c02de0>
2025-05-07T20:33:42.8861605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:42.8863061Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c09d487c0>}
2025-05-07T20:33:42.8864479Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:42.8865576Z context = <triton._C.libtriton.ir.context object at 0x7f1c09aac4b0>
2025-05-07T20:33:42.8865879Z 
2025-05-07T20:33:42.8866058Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:42.8866595Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:42.8867089Z                            module_map=module_map)
2025-05-07T20:33:42.8867465Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:42.8867823Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:42.8868096Z E       ^
2025-05-07T20:33:42.8868583Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:42.8869055Z 
2025-05-07T20:33:42.8869502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:42.8870046Z 
2025-05-07T20:33:42.8870152Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:42.8870584Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:42.8871006Z     T=1,
2025-05-07T20:33:42.8871188Z     D=7168,
2025-05-07T20:33:42.8871383Z     scale_ub=None,
2025-05-07T20:33:42.8871597Z     contiguous=True,
2025-05-07T20:33:42.8871875Z     compiled=True,
2025-05-07T20:33:42.8872087Z )
2025-05-07T20:33:42.8872420Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:42.8872918Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:42.8873196Z 
2025-05-07T20:33:42.8873277Z     @given(
2025-05-07T20:33:42.8873576Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:42.8873895Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:42.8874204Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:42.8874605Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:42.8874941Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:42.8875232Z     )
2025-05-07T20:33:42.8875588Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:42.8876049Z     def test_silu_mul_quant(
2025-05-07T20:33:42.8876294Z         self,
2025-05-07T20:33:42.8876493Z         T: int,
2025-05-07T20:33:42.8876697Z         D: int,
2025-05-07T20:33:42.8876913Z         scale_ub: Optional[float],
2025-05-07T20:33:42.8877192Z         contiguous: bool,
2025-05-07T20:33:42.8877453Z         compiled: bool,
2025-05-07T20:33:42.8885126Z     ) -> None:
2025-05-07T20:33:42.8885377Z         torch.manual_seed(2025)
2025-05-07T20:33:42.8885635Z     
2025-05-07T20:33:42.8885927Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:42.8886297Z     
2025-05-07T20:33:42.8886579Z         x_sign = torch.sign(x)
2025-05-07T20:33:42.8886887Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:42.8887220Z         x = x_sign * x_clamp
2025-05-07T20:33:42.8887480Z         x0 = x[:, :D]
2025-05-07T20:33:42.8887749Z         x1 = x[:, D:]
2025-05-07T20:33:42.8887968Z     
2025-05-07T20:33:42.8888158Z         if contiguous:
2025-05-07T20:33:42.8888408Z             x0 = x0.contiguous()
2025-05-07T20:33:42.8888685Z             x1 = x1.contiguous()
2025-05-07T20:33:42.8888943Z     
2025-05-07T20:33:42.8889139Z         if scale_ub is not None:
2025-05-07T20:33:42.8889428Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:42.8889780Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:42.8890094Z             )
2025-05-07T20:33:42.8890304Z         else:
2025-05-07T20:33:42.8890526Z             scale_ub_tensor = None
2025-05-07T20:33:42.8890785Z     
2025-05-07T20:33:42.8891028Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:42.8891364Z             op = silu_mul_quant
2025-05-07T20:33:42.8891622Z             if compiled:
2025-05-07T20:33:42.8891882Z                 op = torch.compile(op)
2025-05-07T20:33:42.8892198Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:42.8892479Z     
2025-05-07T20:33:42.8892684Z         y_fp8, y_scale = fn()
2025-05-07T20:33:42.8892984Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:42.8893288Z     
2025-05-07T20:33:42.8893528Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:42.8893883Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:42.8894191Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:42.8894510Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:42.8894962Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:42.8895326Z     
2025-05-07T20:33:42.8895603Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:42.8895882Z 
2025-05-07T20:33:42.8895992Z moe/activation_test.py:126: 
2025-05-07T20:33:42.8896301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.8896650Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:42.8896980Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:42.8897804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:42.8898712Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:42.8899275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:42.8899989Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:42.8900750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:42.8901524Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:42.8902327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:42.8903003Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:42.8903638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:42.8904191Z     fn()
2025-05-07T20:33:42.8904723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:42.8905355Z     self.fn.run(
2025-05-07T20:33:42.8905853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:42.8906415Z     kernel = self.compile(
2025-05-07T20:33:42.8906990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:42.8907730Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:42.8908143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:42.8908398Z 
2025-05-07T20:33:42.8908612Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c09c00c80>
2025-05-07T20:33:42.8909747Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:42.8911186Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c09d7a840>}
2025-05-07T20:33:42.8912609Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:42.8913695Z context = <triton._C.libtriton.ir.context object at 0x7f1c093a76b0>
2025-05-07T20:33:42.8914007Z 
2025-05-07T20:33:42.8914180Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:42.8914735Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:42.8915227Z                            module_map=module_map)
2025-05-07T20:33:42.8915599Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:42.8915979Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:42.8916268Z E       ^
2025-05-07T20:33:42.8916746Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:42.8917227Z 
2025-05-07T20:33:42.8917669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:42.8918221Z 
2025-05-07T20:33:42.8918331Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:42.8918763Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:42.8919184Z     T=4096,
2025-05-07T20:33:42.8919383Z     D=5120,
2025-05-07T20:33:42.8919580Z     scale_ub=None,
2025-05-07T20:33:42.8919798Z     contiguous=False,
2025-05-07T20:33:42.8920029Z     compiled=False,
2025-05-07T20:33:42.8920238Z )
2025-05-07T20:33:43.6886239Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.6888079Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:43.6888516Z 
2025-05-07T20:33:43.6888624Z     @given(
2025-05-07T20:33:43.6888861Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.6889196Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.6889603Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.6889947Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.6890296Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.6890661Z     )
2025-05-07T20:33:43.6891015Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.6891479Z     def test_silu_mul_quant(
2025-05-07T20:33:43.6891728Z         self,
2025-05-07T20:33:43.6891931Z         T: int,
2025-05-07T20:33:43.6892128Z         D: int,
2025-05-07T20:33:43.6892353Z         scale_ub: Optional[float],
2025-05-07T20:33:43.6892634Z         contiguous: bool,
2025-05-07T20:33:43.6892876Z         compiled: bool,
2025-05-07T20:33:43.6893107Z     ) -> None:
2025-05-07T20:33:43.6893328Z         torch.manual_seed(2025)
2025-05-07T20:33:43.6893566Z     
2025-05-07T20:33:43.6893849Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.6894215Z     
2025-05-07T20:33:43.6894409Z         x_sign = torch.sign(x)
2025-05-07T20:33:43.6894824Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:43.6895220Z         x = x_sign * x_clamp
2025-05-07T20:33:43.6895460Z         x0 = x[:, :D]
2025-05-07T20:33:43.6895685Z         x1 = x[:, D:]
2025-05-07T20:33:43.6895901Z     
2025-05-07T20:33:43.6896086Z         if contiguous:
2025-05-07T20:33:43.6896330Z             x0 = x0.contiguous()
2025-05-07T20:33:43.6896599Z             x1 = x1.contiguous()
2025-05-07T20:33:43.6896844Z     
2025-05-07T20:33:43.6897045Z         if scale_ub is not None:
2025-05-07T20:33:43.6897325Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:43.6897675Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:43.6898031Z             )
2025-05-07T20:33:43.6898227Z         else:
2025-05-07T20:33:43.6898445Z             scale_ub_tensor = None
2025-05-07T20:33:43.6898704Z     
2025-05-07T20:33:43.6898941Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:43.6899265Z             op = silu_mul_quant
2025-05-07T20:33:43.6899510Z             if compiled:
2025-05-07T20:33:43.6899764Z                 op = torch.compile(op)
2025-05-07T20:33:43.6900073Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.6900356Z     
2025-05-07T20:33:43.6900557Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:43.6900720Z 
2025-05-07T20:33:43.6900827Z moe/activation_test.py:117: 
2025-05-07T20:33:43.6901122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.6901477Z moe/activation_test.py:115: in fn
2025-05-07T20:33:43.6901779Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.6902518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:43.6903251Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:43.6903828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:43.6904566Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:43.6905282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:43.6905847Z     kernel = self.compile(
2025-05-07T20:33:43.6906431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:43.6907131Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:43.6907543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.6907854Z 
2025-05-07T20:33:43.6908068Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c0af99880>
2025-05-07T20:33:43.6909264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:43.6910724Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c09d7bec0>}
2025-05-07T20:33:43.6912203Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:43.6913298Z context = <triton._C.libtriton.ir.context object at 0x7f1c08dcd770>
2025-05-07T20:33:43.6913617Z 
2025-05-07T20:33:43.6913799Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:43.6914362Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:43.6914868Z                            module_map=module_map)
2025-05-07T20:33:43.6915256Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:43.6915641Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:43.6915931Z E       ^
2025-05-07T20:33:43.6916463Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:43.6916955Z 
2025-05-07T20:33:43.6917400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:43.6917951Z 
2025-05-07T20:33:43.6918062Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.6918503Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.6918927Z     T=4096,
2025-05-07T20:33:43.6919133Z     D=7168,
2025-05-07T20:33:43.6919344Z     scale_ub=None,
2025-05-07T20:33:43.6919568Z     contiguous=False,
2025-05-07T20:33:43.6919807Z     compiled=False,
2025-05-07T20:33:43.6920019Z )
2025-05-07T20:33:43.6920353Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.6920882Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:43.6921170Z 
2025-05-07T20:33:43.6921261Z     @given(
2025-05-07T20:33:43.6921497Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.6921830Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.6922158Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.6922501Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.6922836Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.6923140Z     )
2025-05-07T20:33:43.6923502Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.6923964Z     def test_silu_mul_quant(
2025-05-07T20:33:43.6924217Z         self,
2025-05-07T20:33:43.6924427Z         T: int,
2025-05-07T20:33:43.6924622Z         D: int,
2025-05-07T20:33:43.6924850Z         scale_ub: Optional[float],
2025-05-07T20:33:43.6925138Z         contiguous: bool,
2025-05-07T20:33:43.6925382Z         compiled: bool,
2025-05-07T20:33:43.6925800Z     ) -> None:
2025-05-07T20:33:43.6926015Z         torch.manual_seed(2025)
2025-05-07T20:33:43.6926263Z     
2025-05-07T20:33:43.6926540Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.6926889Z     
2025-05-07T20:33:43.6927097Z         x_sign = torch.sign(x)
2025-05-07T20:33:43.6927386Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:43.6927707Z         x = x_sign * x_clamp
2025-05-07T20:33:43.6927948Z         x0 = x[:, :D]
2025-05-07T20:33:43.6928162Z         x1 = x[:, D:]
2025-05-07T20:33:43.6928374Z     
2025-05-07T20:33:43.6928646Z         if contiguous:
2025-05-07T20:33:43.6928873Z             x0 = x0.contiguous()
2025-05-07T20:33:43.6929141Z             x1 = x1.contiguous()
2025-05-07T20:33:43.6929386Z     
2025-05-07T20:33:43.6929576Z         if scale_ub is not None:
2025-05-07T20:33:43.6929917Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:43.6930263Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:43.6930584Z             )
2025-05-07T20:33:43.6930772Z         else:
2025-05-07T20:33:43.6930992Z             scale_ub_tensor = None
2025-05-07T20:33:43.6931340Z     
2025-05-07T20:33:43.6931568Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:43.6931893Z             op = silu_mul_quant
2025-05-07T20:33:43.6932145Z             if compiled:
2025-05-07T20:33:43.6932390Z                 op = torch.compile(op)
2025-05-07T20:33:43.6932692Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.6932973Z     
2025-05-07T20:33:43.6933164Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:43.6933338Z 
2025-05-07T20:33:43.6933438Z moe/activation_test.py:117: 
2025-05-07T20:33:43.6933740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.6934075Z moe/activation_test.py:115: in fn
2025-05-07T20:33:43.6934366Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.6935223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:43.6935960Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:43.6936521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:43.6937244Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:43.6938000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:43.6938568Z     kernel = self.compile(
2025-05-07T20:33:43.6939135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:43.6939830Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:43.6940249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.6940487Z 
2025-05-07T20:33:43.6940698Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c09efd010>
2025-05-07T20:33:43.6941829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:43.6943264Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c0915ca40>}
2025-05-07T20:33:43.6944677Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:43.6945774Z context = <triton._C.libtriton.ir.context object at 0x7f1c09694f30>
2025-05-07T20:33:43.6946073Z 
2025-05-07T20:33:43.6946249Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:43.6946803Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:43.6947298Z                            module_map=module_map)
2025-05-07T20:33:43.6947674Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:43.6948043Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:43.6948320Z E       ^
2025-05-07T20:33:43.6948810Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:43.6949283Z 
2025-05-07T20:33:43.6949771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:43.6950324Z 
2025-05-07T20:33:43.6950430Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.6950860Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.6951333Z     T=128,
2025-05-07T20:33:43.6951521Z     D=7168,
2025-05-07T20:33:43.6951728Z     scale_ub=None,
2025-05-07T20:33:43.6951959Z     contiguous=False,
2025-05-07T20:33:43.6952191Z     compiled=True,
2025-05-07T20:33:43.6952402Z )
2025-05-07T20:33:43.7519944Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.7521283Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:43.7522038Z 
2025-05-07T20:33:43.7522233Z     @given(
2025-05-07T20:33:43.7522816Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.7523411Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.7524042Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.7524674Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.7525296Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.7526149Z     )
2025-05-07T20:33:43.7526798Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.7527638Z     def test_silu_mul_quant(
2025-05-07T20:33:43.7528079Z         self,
2025-05-07T20:33:43.7528581Z         T: int,
2025-05-07T20:33:43.7528817Z         D: int,
2025-05-07T20:33:43.7529055Z         scale_ub: Optional[float],
2025-05-07T20:33:43.7529341Z         contiguous: bool,
2025-05-07T20:33:43.7529602Z         compiled: bool,
2025-05-07T20:33:43.7529845Z     ) -> None:
2025-05-07T20:33:43.7530074Z         torch.manual_seed(2025)
2025-05-07T20:33:43.7530334Z     
2025-05-07T20:33:43.7530631Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.7530997Z     
2025-05-07T20:33:43.7531207Z         x_sign = torch.sign(x)
2025-05-07T20:33:43.7531517Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:43.7531858Z         x = x_sign * x_clamp
2025-05-07T20:33:43.7532113Z         x0 = x[:, :D]
2025-05-07T20:33:43.7532351Z         x1 = x[:, D:]
2025-05-07T20:33:43.7532583Z     
2025-05-07T20:33:43.7532780Z         if contiguous:
2025-05-07T20:33:43.7533030Z             x0 = x0.contiguous()
2025-05-07T20:33:43.7533312Z             x1 = x1.contiguous()
2025-05-07T20:33:43.7533558Z     
2025-05-07T20:33:43.7533759Z         if scale_ub is not None:
2025-05-07T20:33:43.7534040Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:43.7534377Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:43.7534797Z             )
2025-05-07T20:33:43.7534999Z         else:
2025-05-07T20:33:43.7535204Z             scale_ub_tensor = None
2025-05-07T20:33:43.7535461Z     
2025-05-07T20:33:43.7535694Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:43.7536006Z             op = silu_mul_quant
2025-05-07T20:33:43.7536260Z             if compiled:
2025-05-07T20:33:43.7536508Z                 op = torch.compile(op)
2025-05-07T20:33:43.7536800Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.7537081Z     
2025-05-07T20:33:43.7537270Z         y_fp8, y_scale = fn()
2025-05-07T20:33:43.7537563Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:43.7537852Z     
2025-05-07T20:33:43.7538096Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:43.7538438Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:43.7538734Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:43.7539053Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:43.7539416Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:43.7539730Z     
2025-05-07T20:33:43.7539927Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:43.7540199Z 
2025-05-07T20:33:43.7540305Z moe/activation_test.py:126: 
2025-05-07T20:33:43.7540602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.7540935Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:43.7541260Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:43.7542145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:43.7542939Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:43.7543575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:43.7544297Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:43.7545021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:43.7545776Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:43.7546549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:43.7547223Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:43.7547856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:43.7548401Z     fn()
2025-05-07T20:33:43.7548978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:43.7549601Z     self.fn.run(
2025-05-07T20:33:43.7550085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:43.7550641Z     kernel = self.compile(
2025-05-07T20:33:43.7551205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:43.7551893Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:43.7552307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.7552551Z 
2025-05-07T20:33:43.7552765Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c09efedb0>
2025-05-07T20:33:43.7553895Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:43.7555322Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c09d48360>}
2025-05-07T20:33:43.7556725Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:43.7557814Z context = <triton._C.libtriton.ir.context object at 0x7f1c09471430>
2025-05-07T20:33:43.7558125Z 
2025-05-07T20:33:43.7558295Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:43.7558838Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:43.7559318Z                            module_map=module_map)
2025-05-07T20:33:43.7559693Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:43.7560062Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:43.7560338Z E       ^
2025-05-07T20:33:43.7560826Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:43.7561300Z 
2025-05-07T20:33:43.7561736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:43.7562341Z 
2025-05-07T20:33:43.7562452Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.7562876Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.7563293Z     T=128,
2025-05-07T20:33:43.7563491Z     D=7168,
2025-05-07T20:33:43.7563671Z     scale_ub=None,
2025-05-07T20:33:43.7563933Z     contiguous=False,
2025-05-07T20:33:43.7564165Z     compiled=False,
2025-05-07T20:33:43.7564362Z )
2025-05-07T20:33:43.9556086Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.9557620Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:43.9558285Z 
2025-05-07T20:33:43.9558378Z     @given(
2025-05-07T20:33:43.9558606Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.9558920Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.9559230Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.9559565Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.9559906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.9560192Z     )
2025-05-07T20:33:43.9560542Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.9560995Z     def test_silu_mul_quant(
2025-05-07T20:33:43.9561228Z         self,
2025-05-07T20:33:43.9561422Z         T: int,
2025-05-07T20:33:43.9561617Z         D: int,
2025-05-07T20:33:43.9561827Z         scale_ub: Optional[float],
2025-05-07T20:33:43.9562173Z         contiguous: bool,
2025-05-07T20:33:43.9562415Z         compiled: bool,
2025-05-07T20:33:43.9562634Z     ) -> None:
2025-05-07T20:33:43.9562844Z         torch.manual_seed(2025)
2025-05-07T20:33:43.9563083Z     
2025-05-07T20:33:43.9563353Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.9563698Z     
2025-05-07T20:33:43.9563892Z         x_sign = torch.sign(x)
2025-05-07T20:33:43.9564180Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:43.9564491Z         x = x_sign * x_clamp
2025-05-07T20:33:43.9564728Z         x0 = x[:, :D]
2025-05-07T20:33:43.9564935Z         x1 = x[:, D:]
2025-05-07T20:33:43.9565136Z     
2025-05-07T20:33:43.9565319Z         if contiguous:
2025-05-07T20:33:43.9565550Z             x0 = x0.contiguous()
2025-05-07T20:33:43.9565802Z             x1 = x1.contiguous()
2025-05-07T20:33:43.9572472Z     
2025-05-07T20:33:43.9572680Z         if scale_ub is not None:
2025-05-07T20:33:43.9572979Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:43.9573323Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:43.9573643Z             )
2025-05-07T20:33:43.9573848Z         else:
2025-05-07T20:33:43.9574071Z             scale_ub_tensor = None
2025-05-07T20:33:43.9574326Z     
2025-05-07T20:33:43.9574658Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:43.9574986Z             op = silu_mul_quant
2025-05-07T20:33:43.9575235Z             if compiled:
2025-05-07T20:33:43.9575495Z                 op = torch.compile(op)
2025-05-07T20:33:43.9575801Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.9576075Z     
2025-05-07T20:33:43.9576268Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:43.9576433Z 
2025-05-07T20:33:43.9576539Z moe/activation_test.py:117: 
2025-05-07T20:33:43.9576850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.9577192Z moe/activation_test.py:115: in fn
2025-05-07T20:33:43.9577490Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.9578269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:43.9578999Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:43.9579563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:43.9580277Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:43.9581081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:43.9581637Z     kernel = self.compile(
2025-05-07T20:33:43.9582263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:43.9582951Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:43.9583358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.9583610Z 
2025-05-07T20:33:43.9583860Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c09d8bce0>
2025-05-07T20:33:43.9584985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:43.9586413Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c097c1940>}
2025-05-07T20:33:43.9587823Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:43.9588900Z context = <triton._C.libtriton.ir.context object at 0x7f1c094af630>
2025-05-07T20:33:43.9589204Z 
2025-05-07T20:33:43.9589415Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:43.9589954Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:43.9590436Z                            module_map=module_map)
2025-05-07T20:33:43.9590799Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:43.9591155Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:43.9591421Z E       ^
2025-05-07T20:33:43.9591901Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:43.9592375Z 
2025-05-07T20:33:43.9592808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:43.9593360Z 
2025-05-07T20:33:43.9593462Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.9593884Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.9594293Z     T=4096,
2025-05-07T20:33:43.9594480Z     D=5120,
2025-05-07T20:33:43.9594684Z     scale_ub=1200.0,
2025-05-07T20:33:43.9594900Z     contiguous=True,
2025-05-07T20:33:43.9595122Z     compiled=False,
2025-05-07T20:33:43.9595325Z )
2025-05-07T20:33:43.9595640Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:43.9596147Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:43.9596433Z 
2025-05-07T20:33:43.9596519Z     @given(
2025-05-07T20:33:43.9596750Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:43.9597060Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:43.9597371Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:43.9597704Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:43.9598031Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:43.9598320Z     )
2025-05-07T20:33:43.9598679Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:43.9599132Z     def test_silu_mul_quant(
2025-05-07T20:33:43.9599377Z         self,
2025-05-07T20:33:43.9599576Z         T: int,
2025-05-07T20:33:43.9599764Z         D: int,
2025-05-07T20:33:43.9599985Z         scale_ub: Optional[float],
2025-05-07T20:33:43.9600258Z         contiguous: bool,
2025-05-07T20:33:43.9600494Z         compiled: bool,
2025-05-07T20:33:43.9600717Z     ) -> None:
2025-05-07T20:33:43.9600933Z         torch.manual_seed(2025)
2025-05-07T20:33:43.9601230Z     
2025-05-07T20:33:43.9601503Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:43.9601859Z     
2025-05-07T20:33:43.9602063Z         x_sign = torch.sign(x)
2025-05-07T20:33:43.9602352Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:43.9602706Z         x = x_sign * x_clamp
2025-05-07T20:33:43.9602952Z         x0 = x[:, :D]
2025-05-07T20:33:43.9603166Z         x1 = x[:, D:]
2025-05-07T20:33:43.9603374Z     
2025-05-07T20:33:43.9603558Z         if contiguous:
2025-05-07T20:33:43.9603824Z             x0 = x0.contiguous()
2025-05-07T20:33:43.9604080Z             x1 = x1.contiguous()
2025-05-07T20:33:43.9604317Z     
2025-05-07T20:33:43.9604501Z         if scale_ub is not None:
2025-05-07T20:33:43.9604782Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:43.9605119Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:43.9605429Z             )
2025-05-07T20:33:43.9605620Z         else:
2025-05-07T20:33:43.9605824Z             scale_ub_tensor = None
2025-05-07T20:33:43.9606068Z     
2025-05-07T20:33:43.9606298Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:43.9606610Z             op = silu_mul_quant
2025-05-07T20:33:43.9606854Z             if compiled:
2025-05-07T20:33:43.9607097Z                 op = torch.compile(op)
2025-05-07T20:33:43.9607392Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.9607666Z     
2025-05-07T20:33:43.9607899Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:43.9608067Z 
2025-05-07T20:33:43.9608167Z moe/activation_test.py:117: 
2025-05-07T20:33:43.9608462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.9608793Z moe/activation_test.py:115: in fn
2025-05-07T20:33:43.9609076Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:43.9609793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:43.9610518Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:43.9611070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:43.9611784Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:43.9612476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:43.9613030Z     kernel = self.compile(
2025-05-07T20:33:43.9613587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:43.9614270Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:43.9614786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:43.9615024Z 
2025-05-07T20:33:43.9615234Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c09d8b140>
2025-05-07T20:33:43.9616361Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:43.9617788Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c097c2a20>}
2025-05-07T20:33:43.9619204Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:43.9620288Z context = <triton._C.libtriton.ir.context object at 0x7f1c087b2df0>
2025-05-07T20:33:43.9620584Z 
2025-05-07T20:33:43.9620751Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:43.9621283Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:43.9621813Z                            module_map=module_map)
2025-05-07T20:33:43.9622176Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:43.9622533Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:43.9622800Z E       ^
2025-05-07T20:33:43.9623314Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:43.9623789Z 
2025-05-07T20:33:43.9624230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:43.9624815Z 
2025-05-07T20:33:43.9624921Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:43.9625345Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:43.9625938Z     T=1,
2025-05-07T20:33:43.9626128Z     D=5120,
2025-05-07T20:33:43.9626329Z     scale_ub=None,
2025-05-07T20:33:43.9626547Z     contiguous=True,
2025-05-07T20:33:43.9626765Z     compiled=True,
2025-05-07T20:33:43.9626966Z )
2025-05-07T20:33:44.3401749Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.3403260Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:44.3404003Z 
2025-05-07T20:33:44.3404245Z     @given(
2025-05-07T20:33:44.3404696Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.3405536Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.3406152Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.3406818Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.3407483Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.3407893Z     )
2025-05-07T20:33:44.3408283Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.3408739Z     def test_silu_mul_quant(
2025-05-07T20:33:44.3408983Z         self,
2025-05-07T20:33:44.3409187Z         T: int,
2025-05-07T20:33:44.3409387Z         D: int,
2025-05-07T20:33:44.3409609Z         scale_ub: Optional[float],
2025-05-07T20:33:44.3409881Z         contiguous: bool,
2025-05-07T20:33:44.3410115Z         compiled: bool,
2025-05-07T20:33:44.3410344Z     ) -> None:
2025-05-07T20:33:44.3410558Z         torch.manual_seed(2025)
2025-05-07T20:33:44.3410791Z     
2025-05-07T20:33:44.3411075Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.3411435Z     
2025-05-07T20:33:44.3411622Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.3411913Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.3412224Z         x = x_sign * x_clamp
2025-05-07T20:33:44.3412464Z         x0 = x[:, :D]
2025-05-07T20:33:44.3412682Z         x1 = x[:, D:]
2025-05-07T20:33:44.3412894Z     
2025-05-07T20:33:44.3413083Z         if contiguous:
2025-05-07T20:33:44.3413311Z             x0 = x0.contiguous()
2025-05-07T20:33:44.3413568Z             x1 = x1.contiguous()
2025-05-07T20:33:44.3413814Z     
2025-05-07T20:33:44.3414006Z         if scale_ub is not None:
2025-05-07T20:33:44.3414284Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.3414738Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.3415049Z             )
2025-05-07T20:33:44.3415245Z         else:
2025-05-07T20:33:44.3415460Z             scale_ub_tensor = None
2025-05-07T20:33:44.3415712Z     
2025-05-07T20:33:44.3415951Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.3416276Z             op = silu_mul_quant
2025-05-07T20:33:44.3416524Z             if compiled:
2025-05-07T20:33:44.3416778Z                 op = torch.compile(op)
2025-05-07T20:33:44.3417080Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.3417351Z     
2025-05-07T20:33:44.3417542Z         y_fp8, y_scale = fn()
2025-05-07T20:33:44.3417829Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:44.3418126Z     
2025-05-07T20:33:44.3418436Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.3418786Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:44.3419078Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:44.3419393Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:44.3419820Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.3420131Z     
2025-05-07T20:33:44.3420335Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:44.3420539Z 
2025-05-07T20:33:44.3420639Z moe/activation_test.py:126: 
2025-05-07T20:33:44.3421023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.3421352Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:44.3421677Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.3422495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:44.3423279Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:44.3423843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.3424560Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.3425281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:44.3426278Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.3427046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:44.3427716Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:44.3428394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:44.3428929Z     fn()
2025-05-07T20:33:44.3429463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:44.3430073Z     self.fn.run(
2025-05-07T20:33:44.3430550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.3431112Z     kernel = self.compile(
2025-05-07T20:33:44.3431683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.3432370Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.3432779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.3433019Z 
2025-05-07T20:33:44.3433233Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c093d2f30>
2025-05-07T20:33:44.3434363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.3435794Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c097c37e0>}
2025-05-07T20:33:44.3437206Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.3438295Z context = <triton._C.libtriton.ir.context object at 0x7f1c087d15b0>
2025-05-07T20:33:44.3438598Z 
2025-05-07T20:33:44.3438768Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.3439307Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.3439787Z                            module_map=module_map)
2025-05-07T20:33:44.3440225Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.3440590Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:44.3440868Z E       ^
2025-05-07T20:33:44.3441342Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.3441875Z 
2025-05-07T20:33:44.3442313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.3442859Z 
2025-05-07T20:33:44.3442968Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.3443444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.3443865Z     T=2048,
2025-05-07T20:33:44.3444047Z     D=5120,
2025-05-07T20:33:44.3444227Z     scale_ub=None,
2025-05-07T20:33:44.3444440Z     contiguous=True,
2025-05-07T20:33:44.3444660Z     compiled=True,
2025-05-07T20:33:44.3444859Z )
2025-05-07T20:33:44.7071939Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:44.7072711Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:44.7073105Z 
2025-05-07T20:33:44.7073218Z     @given(
2025-05-07T20:33:44.7073530Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:44.7073928Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:44.7074247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:44.7074699Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:44.7075061Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:44.7075363Z     )
2025-05-07T20:33:44.7075722Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:44.7076174Z     def test_silu_mul_quant(
2025-05-07T20:33:44.7076423Z         self,
2025-05-07T20:33:44.7076628Z         T: int,
2025-05-07T20:33:44.7076825Z         D: int,
2025-05-07T20:33:44.7077053Z         scale_ub: Optional[float],
2025-05-07T20:33:44.7077335Z         contiguous: bool,
2025-05-07T20:33:44.7077577Z         compiled: bool,
2025-05-07T20:33:44.7077811Z     ) -> None:
2025-05-07T20:33:44.7078033Z         torch.manual_seed(2025)
2025-05-07T20:33:44.7078272Z     
2025-05-07T20:33:44.7078539Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:44.7078890Z     
2025-05-07T20:33:44.7079085Z         x_sign = torch.sign(x)
2025-05-07T20:33:44.7079370Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:44.7079682Z         x = x_sign * x_clamp
2025-05-07T20:33:44.7079920Z         x0 = x[:, :D]
2025-05-07T20:33:44.7080122Z         x1 = x[:, D:]
2025-05-07T20:33:44.7080325Z     
2025-05-07T20:33:44.7080509Z         if contiguous:
2025-05-07T20:33:44.7080736Z             x0 = x0.contiguous()
2025-05-07T20:33:44.7080995Z             x1 = x1.contiguous()
2025-05-07T20:33:44.7081233Z     
2025-05-07T20:33:44.7081417Z         if scale_ub is not None:
2025-05-07T20:33:44.7081690Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:44.7082029Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:44.7082341Z             )
2025-05-07T20:33:44.7082534Z         else:
2025-05-07T20:33:44.7082751Z             scale_ub_tensor = None
2025-05-07T20:33:44.7083007Z     
2025-05-07T20:33:44.7083238Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.7083556Z             op = silu_mul_quant
2025-05-07T20:33:44.7083803Z             if compiled:
2025-05-07T20:33:44.7084048Z                 op = torch.compile(op)
2025-05-07T20:33:44.7084346Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:44.7084620Z     
2025-05-07T20:33:44.7084802Z         y_fp8, y_scale = fn()
2025-05-07T20:33:44.7085084Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:44.7085374Z     
2025-05-07T20:33:44.7085604Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:44.7085944Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:44.7086314Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:44.7086626Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:44.7086988Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.7087298Z     
2025-05-07T20:33:44.7087558Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:44.7087758Z 
2025-05-07T20:33:44.7087857Z moe/activation_test.py:126: 
2025-05-07T20:33:44.7088155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.7088499Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:44.7088884Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:44.7089697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:44.7090485Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:44.7091051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:44.7091763Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:44.7092480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:44.7093239Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:44.7094039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:44.7094811Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:44.7095442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:44.7095984Z     fn()
2025-05-07T20:33:44.7096516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:44.7097135Z     self.fn.run(
2025-05-07T20:33:44.7097623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:44.7098186Z     kernel = self.compile(
2025-05-07T20:33:44.7098748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:44.7099432Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:44.7099849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:44.7100089Z 
2025-05-07T20:33:44.7100303Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c08dac260>
2025-05-07T20:33:44.7101429Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:44.7102859Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c0954de40>}
2025-05-07T20:33:44.7104268Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:44.7105354Z context = <triton._C.libtriton.ir.context object at 0x7f1c08842e70>
2025-05-07T20:33:44.7105656Z 
2025-05-07T20:33:44.7105827Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:44.7106369Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:44.7106856Z                            module_map=module_map)
2025-05-07T20:33:44.7107229Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:44.7107597Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:44.7107921Z E       ^
2025-05-07T20:33:44.7108403Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:44.7108873Z 
2025-05-07T20:33:44.7109350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:44.7109901Z 
2025-05-07T20:33:44.7110006Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:44.7110432Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:44.7110847Z     T=128,
2025-05-07T20:33:44.7111078Z     D=5120,
2025-05-07T20:33:44.7111269Z     scale_ub=None,
2025-05-07T20:33:44.7111484Z     contiguous=True,
2025-05-07T20:33:44.7111696Z     compiled=True,
2025-05-07T20:33:44.7111893Z )
2025-05-07T20:33:45.1352499Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:45.1353310Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:45.1353693Z 
2025-05-07T20:33:45.1353800Z     @given(
2025-05-07T20:33:45.1354094Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:45.1354418Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:45.1354726Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:45.1355061Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:45.1355394Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:45.1355694Z     )
2025-05-07T20:33:45.1356158Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:45.1356620Z     def test_silu_mul_quant(
2025-05-07T20:33:45.1356866Z         self,
2025-05-07T20:33:45.1357053Z         T: int,
2025-05-07T20:33:45.1357249Z         D: int,
2025-05-07T20:33:45.1357465Z         scale_ub: Optional[float],
2025-05-07T20:33:45.1357735Z         contiguous: bool,
2025-05-07T20:33:45.1357977Z         compiled: bool,
2025-05-07T20:33:45.1358197Z     ) -> None:
2025-05-07T20:33:45.1358406Z         torch.manual_seed(2025)
2025-05-07T20:33:45.1358650Z     
2025-05-07T20:33:45.1358927Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:45.1359275Z     
2025-05-07T20:33:45.1359466Z         x_sign = torch.sign(x)
2025-05-07T20:33:45.1359760Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:45.1360071Z         x = x_sign * x_clamp
2025-05-07T20:33:45.1360308Z         x0 = x[:, :D]
2025-05-07T20:33:45.1360519Z         x1 = x[:, D:]
2025-05-07T20:33:45.1360725Z     
2025-05-07T20:33:45.1360907Z         if contiguous:
2025-05-07T20:33:45.1361141Z             x0 = x0.contiguous()
2025-05-07T20:33:45.1361403Z             x1 = x1.contiguous()
2025-05-07T20:33:45.1361641Z     
2025-05-07T20:33:45.1361838Z         if scale_ub is not None:
2025-05-07T20:33:45.1362108Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:45.1362436Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:45.1362745Z             )
2025-05-07T20:33:45.1362942Z         else:
2025-05-07T20:33:45.1363148Z             scale_ub_tensor = None
2025-05-07T20:33:45.1363408Z     
2025-05-07T20:33:45.1369923Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:45.1370265Z             op = silu_mul_quant
2025-05-07T20:33:45.1370538Z             if compiled:
2025-05-07T20:33:45.1370793Z                 op = torch.compile(op)
2025-05-07T20:33:45.1371108Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:45.1371403Z     
2025-05-07T20:33:45.1371604Z         y_fp8, y_scale = fn()
2025-05-07T20:33:45.1371907Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:45.1372215Z     
2025-05-07T20:33:45.1372464Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:45.1372818Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:45.1373128Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:45.1373449Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:45.1373931Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:45.1374258Z     
2025-05-07T20:33:45.1374471Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:45.1374752Z 
2025-05-07T20:33:45.1374858Z moe/activation_test.py:126: 
2025-05-07T20:33:45.1375243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:45.1375595Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:45.1375930Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:45.1376759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:45.1377624Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:45.1378204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:45.1378919Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:45.1379658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:45.1380428Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:45.1381204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:45.1381927Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:45.1382568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:45.1383130Z     fn()
2025-05-07T20:33:45.1383667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:45.1384293Z     self.fn.run(
2025-05-07T20:33:45.1384796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:45.1385367Z     kernel = self.compile(
2025-05-07T20:33:45.1385938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:45.1386641Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:45.1387070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:45.1387315Z 
2025-05-07T20:33:45.1387538Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c09517e90>
2025-05-07T20:33:45.1388678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:45.1390119Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c085a2ac0>}
2025-05-07T20:33:45.1391541Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:45.1392637Z context = <triton._C.libtriton.ir.context object at 0x7f1c08f47130>
2025-05-07T20:33:45.1392943Z 
2025-05-07T20:33:45.1393117Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:45.1393667Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:45.1394161Z                            module_map=module_map)
2025-05-07T20:33:45.1394538Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:45.1394920Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:45.1395206Z E       ^
2025-05-07T20:33:45.1395697Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:45.1396225Z 
2025-05-07T20:33:45.1396667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:45.1397219Z 
2025-05-07T20:33:45.1397329Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:45.1397805Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:45.1398237Z     T=4096,
2025-05-07T20:33:45.1398438Z     D=5120,
2025-05-07T20:33:45.1398643Z     scale_ub=None,
2025-05-07T20:33:45.1398875Z     contiguous=True,
2025-05-07T20:33:45.1399106Z     compiled=True,
2025-05-07T20:33:45.1399371Z )
2025-05-07T20:33:45.5679612Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:45.5680394Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:45.5680763Z 
2025-05-07T20:33:45.5680863Z     @given(
2025-05-07T20:33:45.5681105Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:45.5681459Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:45.5681784Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:45.5682130Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:45.5682475Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:45.5682777Z     )
2025-05-07T20:33:45.5683134Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:45.5683596Z     def test_silu_mul_quant(
2025-05-07T20:33:45.5684005Z         self,
2025-05-07T20:33:45.5684213Z         T: int,
2025-05-07T20:33:45.5684412Z         D: int,
2025-05-07T20:33:45.5684635Z         scale_ub: Optional[float],
2025-05-07T20:33:45.5684912Z         contiguous: bool,
2025-05-07T20:33:45.5685149Z         compiled: bool,
2025-05-07T20:33:45.5685379Z     ) -> None:
2025-05-07T20:33:45.5685600Z         torch.manual_seed(2025)
2025-05-07T20:33:45.5685838Z     
2025-05-07T20:33:45.5686113Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:45.5686473Z     
2025-05-07T20:33:45.5686665Z         x_sign = torch.sign(x)
2025-05-07T20:33:45.5686961Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:45.5687278Z         x = x_sign * x_clamp
2025-05-07T20:33:45.5687517Z         x0 = x[:, :D]
2025-05-07T20:33:45.5687742Z         x1 = x[:, D:]
2025-05-07T20:33:45.5687962Z     
2025-05-07T20:33:45.5688144Z         if contiguous:
2025-05-07T20:33:45.5688380Z             x0 = x0.contiguous()
2025-05-07T20:33:45.5688649Z             x1 = x1.contiguous()
2025-05-07T20:33:45.5688891Z     
2025-05-07T20:33:45.5689083Z         if scale_ub is not None:
2025-05-07T20:33:45.5689359Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:45.5689704Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:45.5690012Z             )
2025-05-07T20:33:45.5690213Z         else:
2025-05-07T20:33:45.5690427Z             scale_ub_tensor = None
2025-05-07T20:33:45.5690678Z     
2025-05-07T20:33:45.5690917Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:45.5691244Z             op = silu_mul_quant
2025-05-07T20:33:45.5691488Z             if compiled:
2025-05-07T20:33:45.5691750Z                 op = torch.compile(op)
2025-05-07T20:33:45.5692056Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:45.5692346Z     
2025-05-07T20:33:45.5692552Z         y_fp8, y_scale = fn()
2025-05-07T20:33:45.5692845Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:45.5693142Z     
2025-05-07T20:33:45.5693390Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:45.5693742Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:45.5694051Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:45.5694367Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:45.5694883Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:45.5695208Z     
2025-05-07T20:33:45.5695406Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:45.5695702Z 
2025-05-07T20:33:45.5695806Z moe/activation_test.py:126: 
2025-05-07T20:33:45.5696114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:45.5696455Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:45.5696871Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:45.5697704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:45.5698582Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:45.5699151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:45.5699870Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:45.5700593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:45.5701358Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:45.5702126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:45.5702806Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:45.5703443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:45.5704103Z     fn()
2025-05-07T20:33:45.5704648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:45.5705273Z     self.fn.run(
2025-05-07T20:33:45.5705769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:45.5706334Z     kernel = self.compile(
2025-05-07T20:33:45.5706903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:45.5707600Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:45.5708009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:45.5708255Z 
2025-05-07T20:33:45.5708472Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c08a004d0>
2025-05-07T20:33:45.5709607Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:45.5711051Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c0853b4c0>}
2025-05-07T20:33:45.5712459Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:45.5713541Z context = <triton._C.libtriton.ir.context object at 0x7f1c08bf6df0>
2025-05-07T20:33:45.5713851Z 
2025-05-07T20:33:45.5714021Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:45.5714566Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:45.5715050Z                            module_map=module_map)
2025-05-07T20:33:45.5715425Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:45.5715798Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:45.5716079Z E       ^
2025-05-07T20:33:45.5716545Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:45.5717019Z 
2025-05-07T20:33:45.5717452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:45.5718053Z 
2025-05-07T20:33:45.5718159Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:45.5718590Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:45.5719008Z     T=16384,
2025-05-07T20:33:45.5719204Z     D=5120,
2025-05-07T20:33:45.5719457Z     scale_ub=None,
2025-05-07T20:33:45.5719675Z     contiguous=True,
2025-05-07T20:33:45.5719905Z     compiled=True,
2025-05-07T20:33:45.5720120Z )
2025-05-07T20:33:45.5979646Z W0507 20:33:45.595000 96958 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:33:45.5982826Z W0507 20:33:45.595000 96958 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:33:45.5985632Z W0507 20:33:45.595000 96958 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:33:45.5987711Z W0507 20:33:45.595000 96958 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:33:45.5989327Z W0507 20:33:45.595000 96958 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:33:45.6863015Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:45.6863811Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:45.6864214Z 
2025-05-07T20:33:45.6864327Z     @given(
2025-05-07T20:33:45.6864659Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:45.6865023Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:45.6865357Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:45.6865718Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:45.6866074Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:45.6866376Z     )
2025-05-07T20:33:45.6866748Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:45.6867224Z     def test_silu_mul_quant(
2025-05-07T20:33:45.6867484Z         self,
2025-05-07T20:33:45.6867698Z         T: int,
2025-05-07T20:33:45.6867917Z         D: int,
2025-05-07T20:33:45.6868153Z         scale_ub: Optional[float],
2025-05-07T20:33:45.6868442Z         contiguous: bool,
2025-05-07T20:33:45.6868699Z         compiled: bool,
2025-05-07T20:33:45.6868947Z     ) -> None:
2025-05-07T20:33:45.6869174Z         torch.manual_seed(2025)
2025-05-07T20:33:45.6869418Z     
2025-05-07T20:33:45.6869704Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:45.6870066Z     
2025-05-07T20:33:45.6870263Z         x_sign = torch.sign(x)
2025-05-07T20:33:45.6870566Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:45.6870896Z         x = x_sign * x_clamp
2025-05-07T20:33:45.6871139Z         x0 = x[:, :D]
2025-05-07T20:33:45.6871362Z         x1 = x[:, D:]
2025-05-07T20:33:45.6871577Z     
2025-05-07T20:33:45.6871766Z         if contiguous:
2025-05-07T20:33:45.6872011Z             x0 = x0.contiguous()
2025-05-07T20:33:45.6872283Z             x1 = x1.contiguous()
2025-05-07T20:33:45.6872532Z     
2025-05-07T20:33:45.6872733Z         if scale_ub is not None:
2025-05-07T20:33:45.6873021Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:45.6873365Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:45.6873689Z             )
2025-05-07T20:33:45.6873896Z         else:
2025-05-07T20:33:45.6874117Z             scale_ub_tensor = None
2025-05-07T20:33:45.6874375Z     
2025-05-07T20:33:45.6874617Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:45.6874948Z             op = silu_mul_quant
2025-05-07T20:33:45.6875290Z             if compiled:
2025-05-07T20:33:45.6875548Z                 op = torch.compile(op)
2025-05-07T20:33:45.6875860Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:45.6876152Z     
2025-05-07T20:33:45.6876355Z         y_fp8, y_scale = fn()
2025-05-07T20:33:45.6876746Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:45.6877048Z     
2025-05-07T20:33:45.6877302Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:45.6877654Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:45.6877959Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:45.6878354Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:45.6878730Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:45.6879055Z     
2025-05-07T20:33:45.6879258Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:45.6879468Z 
2025-05-07T20:33:45.6879572Z moe/activation_test.py:126: 
2025-05-07T20:33:45.6879884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:45.6880232Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:45.6880572Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:45.6881406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:45.6882206Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:45.6882820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:45.6883542Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:45.6884272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:45.6885031Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:45.6885817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:45.6886507Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:45.6887149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:45.6887704Z     fn()
2025-05-07T20:33:45.6888240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:45.6888862Z     self.fn.run(
2025-05-07T20:33:45.6889361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:45.6889919Z     kernel = self.compile(
2025-05-07T20:33:45.6890489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:45.6891178Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:45.6891595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:45.6891838Z 
2025-05-07T20:33:45.6892051Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c0953d400>
2025-05-07T20:33:45.6893191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:45.6894803Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1df41580>}
2025-05-07T20:33:45.6896227Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:45.6897311Z context = <triton._C.libtriton.ir.context object at 0x7f1c080b7a30>
2025-05-07T20:33:45.6897672Z 
2025-05-07T20:33:45.6897849Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:45.6898434Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:45.6898997Z                            module_map=module_map)
2025-05-07T20:33:45.6899379Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:45.6899758Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:45.6900050Z E       ^
2025-05-07T20:33:45.6900536Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:45.6901060Z 
2025-05-07T20:33:45.6901499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:45.6902051Z 
2025-05-07T20:33:45.6902161Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:45.6902594Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:45.6903020Z     T=1,
2025-05-07T20:33:45.6903221Z     D=5120,
2025-05-07T20:33:45.6903424Z     scale_ub=1200.0,
2025-05-07T20:33:45.6903646Z     contiguous=True,
2025-05-07T20:33:45.6903872Z     compiled=True,
2025-05-07T20:33:45.6904081Z )
2025-05-07T20:33:45.8351592Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:45.8353026Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:45.8353588Z 
2025-05-07T20:33:45.8353753Z     @given(
2025-05-07T20:33:45.8354213Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:45.8354852Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:45.8355461Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:45.8356130Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:45.8356795Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:45.8357367Z     )
2025-05-07T20:33:45.8358074Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:45.8358829Z     def test_silu_mul_quant(
2025-05-07T20:33:45.8359091Z         self,
2025-05-07T20:33:45.8359289Z         T: int,
2025-05-07T20:33:45.8359494Z         D: int,
2025-05-07T20:33:45.8359721Z         scale_ub: Optional[float],
2025-05-07T20:33:45.8359992Z         contiguous: bool,
2025-05-07T20:33:45.8360238Z         compiled: bool,
2025-05-07T20:33:45.8360467Z     ) -> None:
2025-05-07T20:33:45.8360682Z         torch.manual_seed(2025)
2025-05-07T20:33:45.8360930Z     
2025-05-07T20:33:45.8361215Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:45.8361564Z     
2025-05-07T20:33:45.8361763Z         x_sign = torch.sign(x)
2025-05-07T20:33:45.8362061Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:45.8362375Z         x = x_sign * x_clamp
2025-05-07T20:33:45.8362619Z         x0 = x[:, :D]
2025-05-07T20:33:45.8362842Z         x1 = x[:, D:]
2025-05-07T20:33:45.8363045Z     
2025-05-07T20:33:45.8363240Z         if contiguous:
2025-05-07T20:33:45.8363483Z             x0 = x0.contiguous()
2025-05-07T20:33:45.8363741Z             x1 = x1.contiguous()
2025-05-07T20:33:45.8363989Z     
2025-05-07T20:33:45.8364188Z         if scale_ub is not None:
2025-05-07T20:33:45.8364462Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:45.8364804Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:45.8365123Z             )
2025-05-07T20:33:45.8365329Z         else:
2025-05-07T20:33:45.8365541Z             scale_ub_tensor = None
2025-05-07T20:33:45.8365800Z     
2025-05-07T20:33:45.8366034Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:45.8366352Z             op = silu_mul_quant
2025-05-07T20:33:45.8366607Z             if compiled:
2025-05-07T20:33:45.8366860Z                 op = torch.compile(op)
2025-05-07T20:33:45.8367162Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:45.8367518Z     
2025-05-07T20:33:45.8367716Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:45.8367882Z 
2025-05-07T20:33:45.8367983Z moe/activation_test.py:117: 
2025-05-07T20:33:45.8368289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:45.8368693Z moe/activation_test.py:115: in fn
2025-05-07T20:33:45.8368986Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:45.8369569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:45.8370231Z     return fn(*args, **kwargs)
2025-05-07T20:33:45.8370930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:45.8371661Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:45.8372237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:45.8372970Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:45.8373678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:45.8374239Z     kernel = self.compile(
2025-05-07T20:33:45.8374892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:45.8375645Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:45.8376062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:45.8376317Z 
2025-05-07T20:33:45.8376533Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1df669f0>
2025-05-07T20:33:45.8377669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:45.8379165Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1da20680>}
2025-05-07T20:33:45.8380593Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:45.8381687Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d7ba630>
2025-05-07T20:33:45.8382001Z 
2025-05-07T20:33:45.8382174Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:45.8382722Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:45.8383216Z                            module_map=module_map)
2025-05-07T20:33:45.8383592Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:45.8383968Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:45.8384242Z E       ^
2025-05-07T20:33:45.8384725Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:45.8385207Z 
2025-05-07T20:33:45.8385654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:45.8386209Z 
2025-05-07T20:33:45.8386318Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:45.8386756Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:45.8387180Z     T=1,
2025-05-07T20:33:45.8387380Z     D=5120,
2025-05-07T20:33:45.8387586Z     scale_ub=None,
2025-05-07T20:33:45.8387807Z     contiguous=False,
2025-05-07T20:33:45.8388040Z     compiled=True,
2025-05-07T20:33:45.8388251Z )
2025-05-07T20:33:46.0788171Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:46.0788825Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:46.0789417Z 
2025-05-07T20:33:46.0789537Z     @given(
2025-05-07T20:33:46.0789850Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:46.0790192Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:46.0790580Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:46.0790916Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:46.0791257Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:46.0791552Z     )
2025-05-07T20:33:46.0791903Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:46.0792424Z     def test_silu_mul_quant(
2025-05-07T20:33:46.0792674Z         self,
2025-05-07T20:33:46.0792864Z         T: int,
2025-05-07T20:33:46.0793063Z         D: int,
2025-05-07T20:33:46.0793280Z         scale_ub: Optional[float],
2025-05-07T20:33:46.0793554Z         contiguous: bool,
2025-05-07T20:33:46.0793796Z         compiled: bool,
2025-05-07T20:33:46.0794023Z     ) -> None:
2025-05-07T20:33:46.0794242Z         torch.manual_seed(2025)
2025-05-07T20:33:46.0794485Z     
2025-05-07T20:33:46.0794763Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:46.0795123Z     
2025-05-07T20:33:46.0795313Z         x_sign = torch.sign(x)
2025-05-07T20:33:46.0795610Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:46.0795928Z         x = x_sign * x_clamp
2025-05-07T20:33:46.0803636Z         x0 = x[:, :D]
2025-05-07T20:33:46.0803890Z         x1 = x[:, D:]
2025-05-07T20:33:46.0804118Z     
2025-05-07T20:33:46.0804319Z         if contiguous:
2025-05-07T20:33:46.0804569Z             x0 = x0.contiguous()
2025-05-07T20:33:46.0804834Z             x1 = x1.contiguous()
2025-05-07T20:33:46.0805085Z     
2025-05-07T20:33:46.0805281Z         if scale_ub is not None:
2025-05-07T20:33:46.0805557Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:46.0805904Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:46.0806233Z             )
2025-05-07T20:33:46.0806422Z         else:
2025-05-07T20:33:46.0806638Z             scale_ub_tensor = None
2025-05-07T20:33:46.0806903Z     
2025-05-07T20:33:46.0807137Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:46.0807471Z             op = silu_mul_quant
2025-05-07T20:33:46.0807733Z             if compiled:
2025-05-07T20:33:46.0807985Z                 op = torch.compile(op)
2025-05-07T20:33:46.0808300Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.0808584Z     
2025-05-07T20:33:46.0808776Z         y_fp8, y_scale = fn()
2025-05-07T20:33:46.0809081Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:46.0809392Z     
2025-05-07T20:33:46.0809642Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:46.0809981Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:46.0810284Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:46.0810613Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:46.0810976Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:46.0811304Z     
2025-05-07T20:33:46.0811511Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:46.0811716Z 
2025-05-07T20:33:46.0811823Z moe/activation_test.py:126: 
2025-05-07T20:33:46.0812158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.0812510Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:46.0812854Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:46.0813680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:46.0814479Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:46.0815223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:46.0816008Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:46.0816746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:46.0817555Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:46.0818334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:46.0819020Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:46.0819694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:46.0820248Z     fn()
2025-05-07T20:33:46.0820789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:46.0821409Z     self.fn.run(
2025-05-07T20:33:46.0821909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:46.0822480Z     kernel = self.compile(
2025-05-07T20:33:46.0823058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:46.0823755Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:46.0824182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.0824425Z 
2025-05-07T20:33:46.0824692Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c08134650>
2025-05-07T20:33:46.0826179Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:46.0827691Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1da22b60>}
2025-05-07T20:33:46.0829112Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:46.0830214Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d79a9f0>
2025-05-07T20:33:46.0830528Z 
2025-05-07T20:33:46.0830709Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:46.0831264Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:46.0831759Z                            module_map=module_map)
2025-05-07T20:33:46.0832144Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:46.0832517Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:46.0832791Z E       ^
2025-05-07T20:33:46.0833281Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:46.0833767Z 
2025-05-07T20:33:46.0834206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:46.0834754Z 
2025-05-07T20:33:46.0834866Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:46.0835292Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:46.0835711Z     T=1,
2025-05-07T20:33:46.0835911Z     D=5120,
2025-05-07T20:33:46.0836107Z     scale_ub=None,
2025-05-07T20:33:46.0836327Z     contiguous=True,
2025-05-07T20:33:46.0836559Z     compiled=False,
2025-05-07T20:33:46.0836764Z )
2025-05-07T20:33:46.2336594Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:46.2337305Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:46.2337678Z 
2025-05-07T20:33:46.2337788Z     @given(
2025-05-07T20:33:46.2338104Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:46.2338561Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:46.2338875Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:46.2339204Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:46.2339605Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:46.2339908Z     )
2025-05-07T20:33:46.2340264Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:46.2340724Z     def test_silu_mul_quant(
2025-05-07T20:33:46.2340970Z         self,
2025-05-07T20:33:46.2341221Z         T: int,
2025-05-07T20:33:46.2341421Z         D: int,
2025-05-07T20:33:46.2341639Z         scale_ub: Optional[float],
2025-05-07T20:33:46.2341907Z         contiguous: bool,
2025-05-07T20:33:46.2342152Z         compiled: bool,
2025-05-07T20:33:46.2342375Z     ) -> None:
2025-05-07T20:33:46.2342587Z         torch.manual_seed(2025)
2025-05-07T20:33:46.2342833Z     
2025-05-07T20:33:46.2343111Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:46.2343460Z     
2025-05-07T20:33:46.2343646Z         x_sign = torch.sign(x)
2025-05-07T20:33:46.2343936Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:46.2344250Z         x = x_sign * x_clamp
2025-05-07T20:33:46.2344491Z         x0 = x[:, :D]
2025-05-07T20:33:46.2344707Z         x1 = x[:, D:]
2025-05-07T20:33:46.2344917Z     
2025-05-07T20:33:46.2345095Z         if contiguous:
2025-05-07T20:33:46.2345390Z             x0 = x0.contiguous()
2025-05-07T20:33:46.2345653Z             x1 = x1.contiguous()
2025-05-07T20:33:46.2345896Z     
2025-05-07T20:33:46.2346090Z         if scale_ub is not None:
2025-05-07T20:33:46.2346368Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:46.2346700Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:46.2347013Z             )
2025-05-07T20:33:46.2347210Z         else:
2025-05-07T20:33:46.2347426Z             scale_ub_tensor = None
2025-05-07T20:33:46.2347693Z     
2025-05-07T20:33:46.2347927Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:46.2348246Z             op = silu_mul_quant
2025-05-07T20:33:46.2348492Z             if compiled:
2025-05-07T20:33:46.2348739Z                 op = torch.compile(op)
2025-05-07T20:33:46.2349038Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.2349311Z     
2025-05-07T20:33:46.2349504Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:46.2349667Z 
2025-05-07T20:33:46.2349775Z moe/activation_test.py:117: 
2025-05-07T20:33:46.2350063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.2350406Z moe/activation_test.py:115: in fn
2025-05-07T20:33:46.2350694Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.2351407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:46.2352130Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:46.2352690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:46.2353402Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:46.2354091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:46.2354650Z     kernel = self.compile(
2025-05-07T20:33:46.2355215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:46.2355905Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:46.2356303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.2356542Z 
2025-05-07T20:33:46.2356748Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1da46f90>
2025-05-07T20:33:46.2357873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:46.2359397Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1da239c0>}
2025-05-07T20:33:46.2360809Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:46.2361936Z context = <triton._C.libtriton.ir.context object at 0x7f1c0843fa70>
2025-05-07T20:33:46.2362243Z 
2025-05-07T20:33:46.2362413Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:46.2362957Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:46.2363447Z                            module_map=module_map)
2025-05-07T20:33:46.2363822Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:46.2364193Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:46.2364458Z E       ^
2025-05-07T20:33:46.2364942Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:46.2365418Z 
2025-05-07T20:33:46.2365899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:46.2366448Z 
2025-05-07T20:33:46.2366559Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:46.2366982Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:46.2367400Z     T=128,
2025-05-07T20:33:46.2367594Z     D=5120,
2025-05-07T20:33:46.2367789Z     scale_ub=None,
2025-05-07T20:33:46.2368010Z     contiguous=False,
2025-05-07T20:33:46.2368238Z     compiled=True,
2025-05-07T20:33:46.2368448Z )
2025-05-07T20:33:46.2368769Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:46.2369285Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:46.2369569Z 
2025-05-07T20:33:46.2369655Z     @given(
2025-05-07T20:33:46.2369888Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:46.2370217Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:46.2370536Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:46.2370870Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:46.2371220Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:46.2371516Z     )
2025-05-07T20:33:46.2371874Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:46.2372334Z     def test_silu_mul_quant(
2025-05-07T20:33:46.2372588Z         self,
2025-05-07T20:33:46.2372788Z         T: int,
2025-05-07T20:33:46.2372975Z         D: int,
2025-05-07T20:33:46.2373197Z         scale_ub: Optional[float],
2025-05-07T20:33:46.2373483Z         contiguous: bool,
2025-05-07T20:33:46.2373724Z         compiled: bool,
2025-05-07T20:33:46.2373950Z     ) -> None:
2025-05-07T20:33:46.2374166Z         torch.manual_seed(2025)
2025-05-07T20:33:46.2374402Z     
2025-05-07T20:33:46.2374812Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:46.2375167Z     
2025-05-07T20:33:46.2375354Z         x_sign = torch.sign(x)
2025-05-07T20:33:46.2375648Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:46.2375967Z         x = x_sign * x_clamp
2025-05-07T20:33:46.2376204Z         x0 = x[:, :D]
2025-05-07T20:33:46.2376420Z         x1 = x[:, D:]
2025-05-07T20:33:46.2376628Z     
2025-05-07T20:33:46.2376806Z         if contiguous:
2025-05-07T20:33:46.2377034Z             x0 = x0.contiguous()
2025-05-07T20:33:46.2377293Z             x1 = x1.contiguous()
2025-05-07T20:33:46.2377530Z     
2025-05-07T20:33:46.2377776Z         if scale_ub is not None:
2025-05-07T20:33:46.2378051Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:46.2378384Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:46.2378698Z             )
2025-05-07T20:33:46.2378884Z         else:
2025-05-07T20:33:46.2379156Z             scale_ub_tensor = None
2025-05-07T20:33:46.2379408Z     
2025-05-07T20:33:46.2379638Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:46.2379961Z             op = silu_mul_quant
2025-05-07T20:33:46.2380205Z             if compiled:
2025-05-07T20:33:46.2380548Z                 op = torch.compile(op)
2025-05-07T20:33:46.2380846Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.2381120Z     
2025-05-07T20:33:46.2381309Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:46.2381472Z 
2025-05-07T20:33:46.2381572Z moe/activation_test.py:117: 
2025-05-07T20:33:46.2381859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.2382200Z moe/activation_test.py:115: in fn
2025-05-07T20:33:46.2382481Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.2383074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:46.2383652Z     return fn(*args, **kwargs)
2025-05-07T20:33:46.2384341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:46.2385107Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:46.2385670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:46.2386376Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:46.2387066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:46.2387622Z     kernel = self.compile(
2025-05-07T20:33:46.2388179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:46.2388863Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:46.2389266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.2389497Z 
2025-05-07T20:33:46.2389714Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c08ff97c0>
2025-05-07T20:33:46.2390831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:46.2392255Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1da20a40>}
2025-05-07T20:33:46.2393655Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:46.2394730Z context = <triton._C.libtriton.ir.context object at 0x7f1c084b9630>
2025-05-07T20:33:46.2395029Z 
2025-05-07T20:33:46.2395199Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:46.2395726Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:46.2396204Z                            module_map=module_map)
2025-05-07T20:33:46.2396570Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:46.2396921Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:46.2397181Z E       ^
2025-05-07T20:33:46.2397659Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:46.2398126Z 
2025-05-07T20:33:46.2398562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:46.2399152Z 
2025-05-07T20:33:46.2399257Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:46.2399680Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:46.2400136Z     T=128,
2025-05-07T20:33:46.2400317Z     D=7168,
2025-05-07T20:33:46.2400514Z     scale_ub=1200.0,
2025-05-07T20:33:46.2400738Z     contiguous=False,
2025-05-07T20:33:46.2400969Z     compiled=False,
2025-05-07T20:33:46.2401178Z )
2025-05-07T20:33:46.3541545Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:46.3542364Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:46.3542805Z 
2025-05-07T20:33:46.3542914Z     @given(
2025-05-07T20:33:46.3543232Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:46.3543648Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:46.3543969Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:46.3544296Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:46.3544627Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:46.3544915Z     )
2025-05-07T20:33:46.3545265Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:46.3545723Z     def test_silu_mul_quant(
2025-05-07T20:33:46.3545962Z         self,
2025-05-07T20:33:46.3546150Z         T: int,
2025-05-07T20:33:46.3546461Z         D: int,
2025-05-07T20:33:46.3546687Z         scale_ub: Optional[float],
2025-05-07T20:33:46.3546968Z         contiguous: bool,
2025-05-07T20:33:46.3547212Z         compiled: bool,
2025-05-07T20:33:46.3547438Z     ) -> None:
2025-05-07T20:33:46.3547646Z         torch.manual_seed(2025)
2025-05-07T20:33:46.3547885Z     
2025-05-07T20:33:46.3548157Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:46.3548513Z     
2025-05-07T20:33:46.3548705Z         x_sign = torch.sign(x)
2025-05-07T20:33:46.3549045Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:46.3549363Z         x = x_sign * x_clamp
2025-05-07T20:33:46.3549597Z         x0 = x[:, :D]
2025-05-07T20:33:46.3549812Z         x1 = x[:, D:]
2025-05-07T20:33:46.3550015Z     
2025-05-07T20:33:46.3550196Z         if contiguous:
2025-05-07T20:33:46.3550426Z             x0 = x0.contiguous()
2025-05-07T20:33:46.3550687Z             x1 = x1.contiguous()
2025-05-07T20:33:46.3550927Z     
2025-05-07T20:33:46.3551112Z         if scale_ub is not None:
2025-05-07T20:33:46.3551388Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:46.3551719Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:46.3552030Z             )
2025-05-07T20:33:46.3552221Z         else:
2025-05-07T20:33:46.3552424Z             scale_ub_tensor = None
2025-05-07T20:33:46.3552681Z     
2025-05-07T20:33:46.3552909Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:46.3553222Z             op = silu_mul_quant
2025-05-07T20:33:46.3553474Z             if compiled:
2025-05-07T20:33:46.3553721Z                 op = torch.compile(op)
2025-05-07T20:33:46.3554025Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.3554296Z     
2025-05-07T20:33:46.3554493Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:46.3554657Z 
2025-05-07T20:33:46.3554758Z moe/activation_test.py:117: 
2025-05-07T20:33:46.3555052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.3555410Z moe/activation_test.py:115: in fn
2025-05-07T20:33:46.3555693Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.3556415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:46.3557133Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:46.3557693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:46.3558485Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:46.3559235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:46.3559789Z     kernel = self.compile(
2025-05-07T20:33:46.3560415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:46.3561113Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:46.3561583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.3561820Z 
2025-05-07T20:33:46.3562031Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1da45c40>
2025-05-07T20:33:46.3563155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:46.3564585Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c08538a40>}
2025-05-07T20:33:46.3565994Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:46.3567113Z context = <triton._C.libtriton.ir.context object at 0x7f1c08425a30>
2025-05-07T20:33:46.3567422Z 
2025-05-07T20:33:46.3567591Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:46.3568134Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:46.3568619Z                            module_map=module_map)
2025-05-07T20:33:46.3568986Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:46.3569354Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:46.3569624Z E       ^
2025-05-07T20:33:46.3570103Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:46.3570580Z 
2025-05-07T20:33:46.3571019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:46.3571567Z 
2025-05-07T20:33:46.3571675Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:46.3572106Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:46.3572526Z     T=128,
2025-05-07T20:33:46.3572711Z     D=5120,
2025-05-07T20:33:46.3572900Z     scale_ub=None,
2025-05-07T20:33:46.3573110Z     contiguous=False,
2025-05-07T20:33:46.3573335Z     compiled=False,
2025-05-07T20:33:46.3573537Z )
2025-05-07T20:33:46.3573851Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:46.3574361Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:46.3574799Z 
2025-05-07T20:33:46.3574883Z     @given(
2025-05-07T20:33:46.3575110Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:46.3575421Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:46.3575733Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:46.3576131Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:46.3576500Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:46.3576785Z     )
2025-05-07T20:33:46.3577142Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:46.3577593Z     def test_silu_mul_quant(
2025-05-07T20:33:46.3577842Z         self,
2025-05-07T20:33:46.3578034Z         T: int,
2025-05-07T20:33:46.3578222Z         D: int,
2025-05-07T20:33:46.3578439Z         scale_ub: Optional[float],
2025-05-07T20:33:46.3578709Z         contiguous: bool,
2025-05-07T20:33:46.3579011Z         compiled: bool,
2025-05-07T20:33:46.3579221Z     ) -> None:
2025-05-07T20:33:46.3579436Z         torch.manual_seed(2025)
2025-05-07T20:33:46.3579677Z     
2025-05-07T20:33:46.3579945Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:46.3580296Z     
2025-05-07T20:33:46.3580538Z         x_sign = torch.sign(x)
2025-05-07T20:33:46.3580828Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:46.3581144Z         x = x_sign * x_clamp
2025-05-07T20:33:46.3581380Z         x0 = x[:, :D]
2025-05-07T20:33:46.3581627Z         x1 = x[:, D:]
2025-05-07T20:33:46.3581829Z     
2025-05-07T20:33:46.3582011Z         if contiguous:
2025-05-07T20:33:46.3582234Z             x0 = x0.contiguous()
2025-05-07T20:33:46.3582493Z             x1 = x1.contiguous()
2025-05-07T20:33:46.3582736Z     
2025-05-07T20:33:46.3582918Z         if scale_ub is not None:
2025-05-07T20:33:46.3583187Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:46.3583525Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:46.3583837Z             )
2025-05-07T20:33:46.3584021Z         else:
2025-05-07T20:33:46.3584236Z             scale_ub_tensor = None
2025-05-07T20:33:46.3584486Z     
2025-05-07T20:33:46.3584712Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:46.3585029Z             op = silu_mul_quant
2025-05-07T20:33:46.3585276Z             if compiled:
2025-05-07T20:33:46.3585515Z                 op = torch.compile(op)
2025-05-07T20:33:46.3585866Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.3586151Z     
2025-05-07T20:33:46.3586337Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:46.3586508Z 
2025-05-07T20:33:46.3586604Z moe/activation_test.py:117: 
2025-05-07T20:33:46.3586896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.3587224Z moe/activation_test.py:115: in fn
2025-05-07T20:33:46.3587504Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.3588218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:46.3588942Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:46.3589498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:46.3596670Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:46.3597412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:46.3597986Z     kernel = self.compile(
2025-05-07T20:33:46.3598556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:46.3599241Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:46.3599665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.3599916Z 
2025-05-07T20:33:46.3600127Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c08509be0>
2025-05-07T20:33:46.3601253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:46.3602684Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1da2c400>}
2025-05-07T20:33:46.3604094Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:46.3605183Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d505030>
2025-05-07T20:33:46.3605491Z 
2025-05-07T20:33:46.3605991Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:46.3606530Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:46.3607004Z                            module_map=module_map)
2025-05-07T20:33:46.3607416Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:46.3607784Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:46.3608054Z E       ^
2025-05-07T20:33:46.3608567Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:46.3609122Z 
2025-05-07T20:33:46.3609563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:46.3610108Z 
2025-05-07T20:33:46.3610218Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:46.3610646Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:46.3611078Z     T=128,
2025-05-07T20:33:46.3611274Z     D=5120,
2025-05-07T20:33:46.3611467Z     scale_ub=1200.0,
2025-05-07T20:33:46.3611702Z     contiguous=True,
2025-05-07T20:33:46.3611934Z     compiled=False,
2025-05-07T20:33:46.3612147Z )
2025-05-07T20:33:46.5344167Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:46.5345013Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:46.5345449Z 
2025-05-07T20:33:46.5345561Z     @given(
2025-05-07T20:33:46.5346014Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:46.5346445Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:46.5346842Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:46.5347242Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:46.5347582Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:46.5347878Z     )
2025-05-07T20:33:46.5348236Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:46.5348696Z     def test_silu_mul_quant(
2025-05-07T20:33:46.5348939Z         self,
2025-05-07T20:33:46.5349138Z         T: int,
2025-05-07T20:33:46.5349332Z         D: int,
2025-05-07T20:33:46.5349545Z         scale_ub: Optional[float],
2025-05-07T20:33:46.5349819Z         contiguous: bool,
2025-05-07T20:33:46.5350055Z         compiled: bool,
2025-05-07T20:33:46.5350290Z     ) -> None:
2025-05-07T20:33:46.5350505Z         torch.manual_seed(2025)
2025-05-07T20:33:46.5350760Z     
2025-05-07T20:33:46.5351039Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:46.5351391Z     
2025-05-07T20:33:46.5351584Z         x_sign = torch.sign(x)
2025-05-07T20:33:46.5351884Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:46.5352197Z         x = x_sign * x_clamp
2025-05-07T20:33:46.5352437Z         x0 = x[:, :D]
2025-05-07T20:33:46.5352654Z         x1 = x[:, D:]
2025-05-07T20:33:46.5352861Z     
2025-05-07T20:33:46.5353051Z         if contiguous:
2025-05-07T20:33:46.5353285Z             x0 = x0.contiguous()
2025-05-07T20:33:46.5353541Z             x1 = x1.contiguous()
2025-05-07T20:33:46.5353786Z     
2025-05-07T20:33:46.5353979Z         if scale_ub is not None:
2025-05-07T20:33:46.5354254Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:46.5354604Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:46.5354922Z             )
2025-05-07T20:33:46.5355108Z         else:
2025-05-07T20:33:46.5355326Z             scale_ub_tensor = None
2025-05-07T20:33:46.5355581Z     
2025-05-07T20:33:46.5355816Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:46.5356137Z             op = silu_mul_quant
2025-05-07T20:33:46.5356397Z             if compiled:
2025-05-07T20:33:46.5356651Z                 op = torch.compile(op)
2025-05-07T20:33:46.5356950Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.5357226Z     
2025-05-07T20:33:46.5357417Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:46.5357662Z 
2025-05-07T20:33:46.5357762Z moe/activation_test.py:117: 
2025-05-07T20:33:46.5358064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.5358412Z moe/activation_test.py:115: in fn
2025-05-07T20:33:46.5358765Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.5359492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:46.5360219Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:46.5360835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:46.5361542Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:46.5362239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:46.5362795Z     kernel = self.compile(
2025-05-07T20:33:46.5363357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:46.5364036Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:46.5364441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.5364672Z 
2025-05-07T20:33:46.5364889Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c0850b710>
2025-05-07T20:33:46.5366044Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:46.5367478Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1da2d300>}
2025-05-07T20:33:46.5368888Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:46.5370029Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d59f4b0>
2025-05-07T20:33:46.5370329Z 
2025-05-07T20:33:46.5370506Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:46.5371048Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:46.5371530Z                            module_map=module_map)
2025-05-07T20:33:46.5371910Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:46.5372272Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:46.5372540Z E       ^
2025-05-07T20:33:46.5373025Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:46.5373497Z 
2025-05-07T20:33:46.5373942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:46.5374483Z 
2025-05-07T20:33:46.5374685Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:46.5375119Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:46.5375541Z     T=1,
2025-05-07T20:33:46.5375736Z     D=7168,
2025-05-07T20:33:46.5375936Z     scale_ub=1200.0,
2025-05-07T20:33:46.5376164Z     contiguous=True,
2025-05-07T20:33:46.5376395Z     compiled=True,
2025-05-07T20:33:46.5376600Z )
2025-05-07T20:33:46.5376930Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:46.5377433Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:46.5377704Z 
2025-05-07T20:33:46.5377787Z     @given(
2025-05-07T20:33:46.5378025Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:46.5378353Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:46.5378748Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:46.5379103Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:46.5379445Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:46.5379749Z     )
2025-05-07T20:33:46.5380143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:46.5380601Z     def test_silu_mul_quant(
2025-05-07T20:33:46.5380848Z         self,
2025-05-07T20:33:46.5381044Z         T: int,
2025-05-07T20:33:46.5381236Z         D: int,
2025-05-07T20:33:46.5381456Z         scale_ub: Optional[float],
2025-05-07T20:33:46.5381765Z         contiguous: bool,
2025-05-07T20:33:46.5382001Z         compiled: bool,
2025-05-07T20:33:46.5382223Z     ) -> None:
2025-05-07T20:33:46.5382428Z         torch.manual_seed(2025)
2025-05-07T20:33:46.5382664Z     
2025-05-07T20:33:46.5382939Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:46.5383286Z     
2025-05-07T20:33:46.5383483Z         x_sign = torch.sign(x)
2025-05-07T20:33:46.5383773Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:46.5384076Z         x = x_sign * x_clamp
2025-05-07T20:33:46.5384320Z         x0 = x[:, :D]
2025-05-07T20:33:46.5384529Z         x1 = x[:, D:]
2025-05-07T20:33:46.5384729Z     
2025-05-07T20:33:46.5384918Z         if contiguous:
2025-05-07T20:33:46.5385150Z             x0 = x0.contiguous()
2025-05-07T20:33:46.5385409Z             x1 = x1.contiguous()
2025-05-07T20:33:46.5385690Z     
2025-05-07T20:33:46.5385882Z         if scale_ub is not None:
2025-05-07T20:33:46.5386158Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:46.5386491Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:46.5386805Z             )
2025-05-07T20:33:46.5386999Z         else:
2025-05-07T20:33:46.5387210Z             scale_ub_tensor = None
2025-05-07T20:33:46.5387469Z     
2025-05-07T20:33:46.5387704Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:46.5388019Z             op = silu_mul_quant
2025-05-07T20:33:46.5388270Z             if compiled:
2025-05-07T20:33:46.5388521Z                 op = torch.compile(op)
2025-05-07T20:33:46.5388818Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.5389101Z     
2025-05-07T20:33:46.5389293Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:46.5389457Z 
2025-05-07T20:33:46.5389556Z moe/activation_test.py:117: 
2025-05-07T20:33:46.5389850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.5390194Z moe/activation_test.py:115: in fn
2025-05-07T20:33:46.5390481Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.5391053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:46.5391637Z     return fn(*args, **kwargs)
2025-05-07T20:33:46.5392324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:46.5393048Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:46.5393600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:46.5394314Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:46.5395008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:46.5395563Z     kernel = self.compile(
2025-05-07T20:33:46.5396124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:46.5396813Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:46.5397216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.5397449Z 
2025-05-07T20:33:46.5397660Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c0850a870>
2025-05-07T20:33:46.5398832Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:46.5400295Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1da2eac0>}
2025-05-07T20:33:46.5401708Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:46.5402852Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d9ae1f0>
2025-05-07T20:33:46.5403155Z 
2025-05-07T20:33:46.5403323Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:46.5403861Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:46.5404355Z                            module_map=module_map)
2025-05-07T20:33:46.5404731Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:46.5405096Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:46.5405361Z E       ^
2025-05-07T20:33:46.5405839Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:46.5406315Z 
2025-05-07T20:33:46.5406793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:46.5407344Z 
2025-05-07T20:33:46.5407447Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:46.5407876Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:46.5408297Z     T=1,
2025-05-07T20:33:46.5408485Z     D=7168,
2025-05-07T20:33:46.5408687Z     scale_ub=1200.0,
2025-05-07T20:33:46.5408913Z     contiguous=False,
2025-05-07T20:33:46.5409156Z     compiled=True,
2025-05-07T20:33:46.5409359Z )
2025-05-07T20:33:46.6732533Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:46.6733179Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:46.6733462Z 
2025-05-07T20:33:46.6733560Z     @given(
2025-05-07T20:33:46.6733788Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:46.6734104Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:46.6734408Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:46.6734873Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:46.6735198Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:46.6735490Z     )
2025-05-07T20:33:46.6735839Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:46.6736288Z     def test_silu_mul_quant(
2025-05-07T20:33:46.6736525Z         self,
2025-05-07T20:33:46.6736719Z         T: int,
2025-05-07T20:33:46.6736908Z         D: int,
2025-05-07T20:33:46.6737122Z         scale_ub: Optional[float],
2025-05-07T20:33:46.6737392Z         contiguous: bool,
2025-05-07T20:33:46.6737624Z         compiled: bool,
2025-05-07T20:33:46.6737846Z     ) -> None:
2025-05-07T20:33:46.6738061Z         torch.manual_seed(2025)
2025-05-07T20:33:46.6738299Z     
2025-05-07T20:33:46.6738568Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:46.6738922Z     
2025-05-07T20:33:46.6739111Z         x_sign = torch.sign(x)
2025-05-07T20:33:46.6739393Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:46.6739705Z         x = x_sign * x_clamp
2025-05-07T20:33:46.6739940Z         x0 = x[:, :D]
2025-05-07T20:33:46.6740150Z         x1 = x[:, D:]
2025-05-07T20:33:46.6740357Z     
2025-05-07T20:33:46.6740541Z         if contiguous:
2025-05-07T20:33:46.6740768Z             x0 = x0.contiguous()
2025-05-07T20:33:46.6741025Z             x1 = x1.contiguous()
2025-05-07T20:33:46.6741387Z     
2025-05-07T20:33:46.6741572Z         if scale_ub is not None:
2025-05-07T20:33:46.6741841Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:46.6742186Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:46.6742493Z             )
2025-05-07T20:33:46.6742741Z         else:
2025-05-07T20:33:46.6742952Z             scale_ub_tensor = None
2025-05-07T20:33:46.6743196Z     
2025-05-07T20:33:46.6743428Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:46.6743741Z             op = silu_mul_quant
2025-05-07T20:33:46.6744046Z             if compiled:
2025-05-07T20:33:46.6744289Z                 op = torch.compile(op)
2025-05-07T20:33:46.6744586Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.6744862Z     
2025-05-07T20:33:46.6745045Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:46.6745213Z 
2025-05-07T20:33:46.6745308Z moe/activation_test.py:117: 
2025-05-07T20:33:46.6745605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.6745941Z moe/activation_test.py:115: in fn
2025-05-07T20:33:46.6746222Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.6746801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:46.6747383Z     return fn(*args, **kwargs)
2025-05-07T20:33:46.6748122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:46.6748858Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:46.6749418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:46.6750129Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:46.6750822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:46.6751384Z     kernel = self.compile(
2025-05-07T20:33:46.6751948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:46.6752634Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:46.6753058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.6753299Z 
2025-05-07T20:33:46.6753517Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c0850b950>
2025-05-07T20:33:46.6754644Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:46.6756073Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1da439c0>}
2025-05-07T20:33:46.6757484Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:46.6758690Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d956eb0>
2025-05-07T20:33:46.6759075Z 
2025-05-07T20:33:46.6759273Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:46.6759872Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:46.6760361Z                            module_map=module_map)
2025-05-07T20:33:46.6760837Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:46.6761259Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:46.6761514Z E       ^
2025-05-07T20:33:46.6762156Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:46.6762771Z 
2025-05-07T20:33:46.6763211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:46.6763753Z 
2025-05-07T20:33:46.6763862Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:46.6764416Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:46.6764852Z     T=1,
2025-05-07T20:33:46.6765037Z     D=7168,
2025-05-07T20:33:46.6765225Z     scale_ub=None,
2025-05-07T20:33:46.6765443Z     contiguous=False,
2025-05-07T20:33:46.6765666Z     compiled=True,
2025-05-07T20:33:46.6765913Z )
2025-05-07T20:33:46.7634963Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:46.7635675Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:46.7636053Z 
2025-05-07T20:33:46.7636170Z     @given(
2025-05-07T20:33:46.7636395Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:46.7636718Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:46.7637025Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:46.7637356Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:46.7637684Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:46.7637968Z     )
2025-05-07T20:33:46.7638322Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:46.7638773Z     def test_silu_mul_quant(
2025-05-07T20:33:46.7639119Z         self,
2025-05-07T20:33:46.7639313Z         T: int,
2025-05-07T20:33:46.7639501Z         D: int,
2025-05-07T20:33:46.7639715Z         scale_ub: Optional[float],
2025-05-07T20:33:46.7639982Z         contiguous: bool,
2025-05-07T20:33:46.7640211Z         compiled: bool,
2025-05-07T20:33:46.7640427Z     ) -> None:
2025-05-07T20:33:46.7640637Z         torch.manual_seed(2025)
2025-05-07T20:33:46.7640869Z     
2025-05-07T20:33:46.7641138Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:46.7641492Z     
2025-05-07T20:33:46.7641685Z         x_sign = torch.sign(x)
2025-05-07T20:33:46.7641968Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:46.7642283Z         x = x_sign * x_clamp
2025-05-07T20:33:46.7642526Z         x0 = x[:, :D]
2025-05-07T20:33:46.7642733Z         x1 = x[:, D:]
2025-05-07T20:33:46.7642933Z     
2025-05-07T20:33:46.7643113Z         if contiguous:
2025-05-07T20:33:46.7643338Z             x0 = x0.contiguous()
2025-05-07T20:33:46.7643599Z             x1 = x1.contiguous()
2025-05-07T20:33:46.7643842Z     
2025-05-07T20:33:46.7644028Z         if scale_ub is not None:
2025-05-07T20:33:46.7644295Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:46.7644629Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:46.7644936Z             )
2025-05-07T20:33:46.7645129Z         else:
2025-05-07T20:33:46.7645337Z             scale_ub_tensor = None
2025-05-07T20:33:46.7645584Z     
2025-05-07T20:33:46.7645815Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:46.7646128Z             op = silu_mul_quant
2025-05-07T20:33:46.7646379Z             if compiled:
2025-05-07T20:33:46.7646622Z                 op = torch.compile(op)
2025-05-07T20:33:46.7646921Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.7647194Z     
2025-05-07T20:33:46.7647376Z         y_fp8, y_scale = fn()
2025-05-07T20:33:46.7647660Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:46.7647957Z     
2025-05-07T20:33:46.7648187Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:46.7648536Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:46.7648877Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:46.7649185Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:46.7649547Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:46.7649861Z     
2025-05-07T20:33:46.7650053Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:46.7650329Z 
2025-05-07T20:33:46.7650427Z moe/activation_test.py:126: 
2025-05-07T20:33:46.7650727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.7651069Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:46.7651453Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:46.7652278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:46.7653069Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:46.7653699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:46.7654416Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:46.7655246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:46.7656004Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:46.7656764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:46.7657435Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:46.7658069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:46.7658653Z     fn()
2025-05-07T20:33:46.7659227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:46.7659836Z     self.fn.run(
2025-05-07T20:33:46.7660317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:46.7660865Z     kernel = self.compile(
2025-05-07T20:33:46.7661424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:46.7662110Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:46.7662516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.7662753Z 
2025-05-07T20:33:46.7662964Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d98f980>
2025-05-07T20:33:46.7664085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:46.7665510Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d968b80>}
2025-05-07T20:33:46.7666914Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:46.7667999Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d982d30>
2025-05-07T20:33:46.7668297Z 
2025-05-07T20:33:46.7668464Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:46.7669002Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:46.7669482Z                            module_map=module_map)
2025-05-07T20:33:46.7669847Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:46.7670204Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:46.7670471Z E       ^
2025-05-07T20:33:46.7670945Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:46.7671415Z 
2025-05-07T20:33:46.7671850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:46.7672450Z 
2025-05-07T20:33:46.7672554Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:46.7679494Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:46.7679927Z     T=1,
2025-05-07T20:33:46.7680117Z     D=5120,
2025-05-07T20:33:46.7680394Z     scale_ub=1200.0,
2025-05-07T20:33:46.7680619Z     contiguous=False,
2025-05-07T20:33:46.7680852Z     compiled=True,
2025-05-07T20:33:46.7681059Z )
2025-05-07T20:33:46.9226674Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:46.9227665Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:46.9228068Z 
2025-05-07T20:33:46.9228179Z     @given(
2025-05-07T20:33:46.9228441Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:46.9228759Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:46.9229063Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:46.9229400Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:46.9229727Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:46.9230025Z     )
2025-05-07T20:33:46.9230377Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:46.9230832Z     def test_silu_mul_quant(
2025-05-07T20:33:46.9231089Z         self,
2025-05-07T20:33:46.9231281Z         T: int,
2025-05-07T20:33:46.9231483Z         D: int,
2025-05-07T20:33:46.9231712Z         scale_ub: Optional[float],
2025-05-07T20:33:46.9232073Z         contiguous: bool,
2025-05-07T20:33:46.9232331Z         compiled: bool,
2025-05-07T20:33:46.9232554Z     ) -> None:
2025-05-07T20:33:46.9232768Z         torch.manual_seed(2025)
2025-05-07T20:33:46.9233007Z     
2025-05-07T20:33:46.9233291Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:46.9233638Z     
2025-05-07T20:33:46.9233825Z         x_sign = torch.sign(x)
2025-05-07T20:33:46.9234116Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:46.9234428Z         x = x_sign * x_clamp
2025-05-07T20:33:46.9234662Z         x0 = x[:, :D]
2025-05-07T20:33:46.9234875Z         x1 = x[:, D:]
2025-05-07T20:33:46.9235074Z     
2025-05-07T20:33:46.9235257Z         if contiguous:
2025-05-07T20:33:46.9235488Z             x0 = x0.contiguous()
2025-05-07T20:33:46.9235748Z             x1 = x1.contiguous()
2025-05-07T20:33:46.9235977Z     
2025-05-07T20:33:46.9236169Z         if scale_ub is not None:
2025-05-07T20:33:46.9236444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:46.9236779Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:46.9237091Z             )
2025-05-07T20:33:46.9237280Z         else:
2025-05-07T20:33:46.9237486Z             scale_ub_tensor = None
2025-05-07T20:33:46.9237735Z     
2025-05-07T20:33:46.9237968Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:46.9238276Z             op = silu_mul_quant
2025-05-07T20:33:46.9238531Z             if compiled:
2025-05-07T20:33:46.9238780Z                 op = torch.compile(op)
2025-05-07T20:33:46.9239068Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.9239346Z     
2025-05-07T20:33:46.9239540Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:46.9239703Z 
2025-05-07T20:33:46.9239815Z moe/activation_test.py:117: 
2025-05-07T20:33:46.9240105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.9240443Z moe/activation_test.py:115: in fn
2025-05-07T20:33:46.9240728Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.9241303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:46.9241889Z     return fn(*args, **kwargs)
2025-05-07T20:33:46.9242567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:46.9243286Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:46.9243903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:46.9244610Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:46.9245362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:46.9245922Z     kernel = self.compile(
2025-05-07T20:33:46.9246492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:46.9247220Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:46.9247633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.9247872Z 
2025-05-07T20:33:46.9248083Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d98e6c0>
2025-05-07T20:33:46.9249210Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:46.9250648Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d969e40>}
2025-05-07T20:33:46.9252101Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:46.9253185Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d2f64f0>
2025-05-07T20:33:46.9253499Z 
2025-05-07T20:33:46.9253668Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:46.9254207Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:46.9254809Z                            module_map=module_map)
2025-05-07T20:33:46.9255183Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:46.9255544Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:46.9255813Z E       ^
2025-05-07T20:33:46.9256285Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:46.9256761Z 
2025-05-07T20:33:46.9257196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:46.9257739Z 
2025-05-07T20:33:46.9257843Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:46.9258255Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:46.9258661Z     T=1,
2025-05-07T20:33:46.9258848Z     D=5120,
2025-05-07T20:33:46.9259043Z     scale_ub=1200.0,
2025-05-07T20:33:46.9259259Z     contiguous=False,
2025-05-07T20:33:46.9259485Z     compiled=False,
2025-05-07T20:33:46.9259690Z )
2025-05-07T20:33:46.9260003Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:46.9260505Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:46.9260785Z 
2025-05-07T20:33:46.9260860Z     @given(
2025-05-07T20:33:46.9261087Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:46.9261399Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:46.9261705Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:46.9262034Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:46.9262363Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:46.9262649Z     )
2025-05-07T20:33:46.9262996Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:46.9263446Z     def test_silu_mul_quant(
2025-05-07T20:33:46.9263672Z         self,
2025-05-07T20:33:46.9263862Z         T: int,
2025-05-07T20:33:46.9264061Z         D: int,
2025-05-07T20:33:46.9264323Z         scale_ub: Optional[float],
2025-05-07T20:33:46.9264600Z         contiguous: bool,
2025-05-07T20:33:46.9264835Z         compiled: bool,
2025-05-07T20:33:46.9265044Z     ) -> None:
2025-05-07T20:33:46.9265258Z         torch.manual_seed(2025)
2025-05-07T20:33:46.9265494Z     
2025-05-07T20:33:46.9265802Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:46.9266157Z     
2025-05-07T20:33:46.9266352Z         x_sign = torch.sign(x)
2025-05-07T20:33:46.9266635Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:46.9266988Z         x = x_sign * x_clamp
2025-05-07T20:33:46.9267219Z         x0 = x[:, :D]
2025-05-07T20:33:46.9267424Z         x1 = x[:, D:]
2025-05-07T20:33:46.9267623Z     
2025-05-07T20:33:46.9267801Z         if contiguous:
2025-05-07T20:33:46.9268023Z             x0 = x0.contiguous()
2025-05-07T20:33:46.9268273Z             x1 = x1.contiguous()
2025-05-07T20:33:46.9268521Z     
2025-05-07T20:33:46.9268702Z         if scale_ub is not None:
2025-05-07T20:33:46.9268972Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:46.9269299Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:46.9269613Z             )
2025-05-07T20:33:46.9269797Z         else:
2025-05-07T20:33:46.9270005Z             scale_ub_tensor = None
2025-05-07T20:33:46.9270253Z     
2025-05-07T20:33:46.9270478Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:46.9270798Z             op = silu_mul_quant
2025-05-07T20:33:46.9271095Z             if compiled:
2025-05-07T20:33:46.9271336Z                 op = torch.compile(op)
2025-05-07T20:33:46.9271643Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.9271925Z     
2025-05-07T20:33:46.9272111Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:46.9272282Z 
2025-05-07T20:33:46.9272378Z moe/activation_test.py:117: 
2025-05-07T20:33:46.9272679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.9273023Z moe/activation_test.py:115: in fn
2025-05-07T20:33:46.9273303Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:46.9274013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:46.9274736Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:46.9275286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:46.9276003Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:46.9276696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:46.9277253Z     kernel = self.compile(
2025-05-07T20:33:46.9277806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:46.9278488Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:46.9278946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:46.9279180Z 
2025-05-07T20:33:46.9279390Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d98e210>
2025-05-07T20:33:46.9280506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:46.9281926Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d96aac0>}
2025-05-07T20:33:46.9283330Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:46.9284413Z context = <triton._C.libtriton.ir.context object at 0x7f1b1dc0f6f0>
2025-05-07T20:33:46.9284759Z 
2025-05-07T20:33:46.9284924Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:46.9285453Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:46.9285974Z                            module_map=module_map)
2025-05-07T20:33:46.9286344Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:46.9286699Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:46.9286955Z E       ^
2025-05-07T20:33:46.9287474Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:46.9287944Z 
2025-05-07T20:33:46.9288379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:46.9288977Z 
2025-05-07T20:33:46.9289077Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:46.9289504Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:46.9289917Z     T=16384,
2025-05-07T20:33:46.9290102Z     D=5120,
2025-05-07T20:33:46.9290299Z     scale_ub=1200.0,
2025-05-07T20:33:46.9290523Z     contiguous=False,
2025-05-07T20:33:46.9290745Z     compiled=True,
2025-05-07T20:33:46.9290948Z )
2025-05-07T20:33:47.0168887Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:47.0169841Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:47.0170268Z 
2025-05-07T20:33:47.0170394Z     @given(
2025-05-07T20:33:47.0170706Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:47.0171147Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:47.0171514Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:47.0171849Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:47.0172189Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:47.0172493Z     )
2025-05-07T20:33:47.0172845Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:47.0173297Z     def test_silu_mul_quant(
2025-05-07T20:33:47.0173536Z         self,
2025-05-07T20:33:47.0173728Z         T: int,
2025-05-07T20:33:47.0173914Z         D: int,
2025-05-07T20:33:47.0174132Z         scale_ub: Optional[float],
2025-05-07T20:33:47.0174410Z         contiguous: bool,
2025-05-07T20:33:47.0174744Z         compiled: bool,
2025-05-07T20:33:47.0174971Z     ) -> None:
2025-05-07T20:33:47.0175182Z         torch.manual_seed(2025)
2025-05-07T20:33:47.0175420Z     
2025-05-07T20:33:47.0175689Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:47.0176031Z     
2025-05-07T20:33:47.0176216Z         x_sign = torch.sign(x)
2025-05-07T20:33:47.0176505Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:47.0176821Z         x = x_sign * x_clamp
2025-05-07T20:33:47.0177056Z         x0 = x[:, :D]
2025-05-07T20:33:47.0177264Z         x1 = x[:, D:]
2025-05-07T20:33:47.0177466Z     
2025-05-07T20:33:47.0177651Z         if contiguous:
2025-05-07T20:33:47.0177875Z             x0 = x0.contiguous()
2025-05-07T20:33:47.0178131Z             x1 = x1.contiguous()
2025-05-07T20:33:47.0178371Z     
2025-05-07T20:33:47.0178557Z         if scale_ub is not None:
2025-05-07T20:33:47.0178829Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:47.0179164Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:47.0179470Z             )
2025-05-07T20:33:47.0179656Z         else:
2025-05-07T20:33:47.0179864Z             scale_ub_tensor = None
2025-05-07T20:33:47.0180111Z     
2025-05-07T20:33:47.0180338Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:47.0180650Z             op = silu_mul_quant
2025-05-07T20:33:47.0180894Z             if compiled:
2025-05-07T20:33:47.0181137Z                 op = torch.compile(op)
2025-05-07T20:33:47.0181435Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.0181802Z     
2025-05-07T20:33:47.0181985Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:47.0182154Z 
2025-05-07T20:33:47.0182249Z moe/activation_test.py:117: 
2025-05-07T20:33:47.0182544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.0182933Z moe/activation_test.py:115: in fn
2025-05-07T20:33:47.0183216Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.0183796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:47.0184438Z     return fn(*args, **kwargs)
2025-05-07T20:33:47.0185123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:47.0185852Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:47.0186406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:47.0187119Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:47.0187815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:47.0188368Z     kernel = self.compile(
2025-05-07T20:33:47.0188931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:47.0189663Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:47.0190073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.0190316Z 
2025-05-07T20:33:47.0190532Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1dcc6cc0>
2025-05-07T20:33:47.0191646Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:47.0193082Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1dc7c180>}
2025-05-07T20:33:47.0194499Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:47.0195587Z context = <triton._C.libtriton.ir.context object at 0x7f1b1dc53130>
2025-05-07T20:33:47.0195890Z 
2025-05-07T20:33:47.0196064Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:47.0196599Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:47.0197088Z                            module_map=module_map)
2025-05-07T20:33:47.0197464Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:47.0197822Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:47.0198086Z E       ^
2025-05-07T20:33:47.0198565Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:47.0199037Z 
2025-05-07T20:33:47.0199481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:47.0200022Z 
2025-05-07T20:33:47.0200127Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:47.0200650Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:47.0201080Z     T=2048,
2025-05-07T20:33:47.0201258Z     D=7168,
2025-05-07T20:33:47.0201468Z     scale_ub=1200.0,
2025-05-07T20:33:47.0201690Z     contiguous=False,
2025-05-07T20:33:47.0201918Z     compiled=True,
2025-05-07T20:33:47.0202113Z )
2025-05-07T20:33:47.0202429Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:47.0202989Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:47.0203272Z 
2025-05-07T20:33:47.0203354Z     @given(
2025-05-07T20:33:47.0203572Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:47.0203886Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:47.0204301Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:47.0204628Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:47.0204957Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:47.0205248Z     )
2025-05-07T20:33:47.0205633Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:47.0206092Z     def test_silu_mul_quant(
2025-05-07T20:33:47.0206329Z         self,
2025-05-07T20:33:47.0206518Z         T: int,
2025-05-07T20:33:47.0206707Z         D: int,
2025-05-07T20:33:47.0206923Z         scale_ub: Optional[float],
2025-05-07T20:33:47.0207195Z         contiguous: bool,
2025-05-07T20:33:47.0207437Z         compiled: bool,
2025-05-07T20:33:47.0207658Z     ) -> None:
2025-05-07T20:33:47.0207869Z         torch.manual_seed(2025)
2025-05-07T20:33:47.0208101Z     
2025-05-07T20:33:47.0208375Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:47.0208727Z     
2025-05-07T20:33:47.0208916Z         x_sign = torch.sign(x)
2025-05-07T20:33:47.0209207Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:47.0209518Z         x = x_sign * x_clamp
2025-05-07T20:33:47.0209798Z         x0 = x[:, :D]
2025-05-07T20:33:47.0210016Z         x1 = x[:, D:]
2025-05-07T20:33:47.0210229Z     
2025-05-07T20:33:47.0210410Z         if contiguous:
2025-05-07T20:33:47.0210643Z             x0 = x0.contiguous()
2025-05-07T20:33:47.0210900Z             x1 = x1.contiguous()
2025-05-07T20:33:47.0211137Z     
2025-05-07T20:33:47.0211323Z         if scale_ub is not None:
2025-05-07T20:33:47.0211598Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:47.0211929Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:47.0212226Z             )
2025-05-07T20:33:47.0212413Z         else:
2025-05-07T20:33:47.0212620Z             scale_ub_tensor = None
2025-05-07T20:33:47.0212863Z     
2025-05-07T20:33:47.0213090Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:47.0213418Z             op = silu_mul_quant
2025-05-07T20:33:47.0213657Z             if compiled:
2025-05-07T20:33:47.0213899Z                 op = torch.compile(op)
2025-05-07T20:33:47.0214204Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.0214479Z     
2025-05-07T20:33:47.0214797Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:47.0214957Z 
2025-05-07T20:33:47.0215058Z moe/activation_test.py:117: 
2025-05-07T20:33:47.0215346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.0215681Z moe/activation_test.py:115: in fn
2025-05-07T20:33:47.0215966Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.0216544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:47.0217126Z     return fn(*args, **kwargs)
2025-05-07T20:33:47.0217819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:47.0218546Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:47.0219105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:47.0219823Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:47.0220524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:47.0221085Z     kernel = self.compile(
2025-05-07T20:33:47.0221645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:47.0222393Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:47.0222801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.0223034Z 
2025-05-07T20:33:47.0223251Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1dcc47a0>
2025-05-07T20:33:47.0224410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:47.0226126Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1dc7cea0>}
2025-05-07T20:33:47.0227545Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:47.0228641Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d86b9b0>
2025-05-07T20:33:47.0228949Z 
2025-05-07T20:33:47.0229126Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:47.0229675Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:47.0230161Z                            module_map=module_map)
2025-05-07T20:33:47.0230627Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:47.0230993Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:47.0231271Z E       ^
2025-05-07T20:33:47.0231757Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:47.0232230Z 
2025-05-07T20:33:47.0232673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:47.0233217Z 
2025-05-07T20:33:47.1387701Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:47.1388345Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:47.1389064Z     T=1,
2025-05-07T20:33:47.1389335Z     D=5120,
2025-05-07T20:33:47.1389616Z     scale_ub=None,
2025-05-07T20:33:47.1389914Z     contiguous=False,
2025-05-07T20:33:47.1390156Z     compiled=False,
2025-05-07T20:33:47.1390365Z )
2025-05-07T20:33:47.1390688Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:47.1391198Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:47.1391477Z 
2025-05-07T20:33:47.1391556Z     @given(
2025-05-07T20:33:47.1391788Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:47.1392106Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:47.1392418Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:47.1392758Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:47.1393094Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:47.1393382Z     )
2025-05-07T20:33:47.1393730Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:47.1394183Z     def test_silu_mul_quant(
2025-05-07T20:33:47.1394424Z         self,
2025-05-07T20:33:47.1394626Z         T: int,
2025-05-07T20:33:47.1394818Z         D: int,
2025-05-07T20:33:47.1395038Z         scale_ub: Optional[float],
2025-05-07T20:33:47.1395310Z         contiguous: bool,
2025-05-07T20:33:47.1395547Z         compiled: bool,
2025-05-07T20:33:47.1395774Z     ) -> None:
2025-05-07T20:33:47.1395996Z         torch.manual_seed(2025)
2025-05-07T20:33:47.1396253Z     
2025-05-07T20:33:47.1396534Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:47.1396896Z     
2025-05-07T20:33:47.1397100Z         x_sign = torch.sign(x)
2025-05-07T20:33:47.1397396Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:47.1397833Z         x = x_sign * x_clamp
2025-05-07T20:33:47.1398091Z         x0 = x[:, :D]
2025-05-07T20:33:47.1398314Z         x1 = x[:, D:]
2025-05-07T20:33:47.1398541Z     
2025-05-07T20:33:47.1398744Z         if contiguous:
2025-05-07T20:33:47.1399020Z             x0 = x0.contiguous()
2025-05-07T20:33:47.1399288Z             x1 = x1.contiguous()
2025-05-07T20:33:47.1399606Z     
2025-05-07T20:33:47.1399797Z         if scale_ub is not None:
2025-05-07T20:33:47.1400082Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:47.1400435Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:47.1400850Z             )
2025-05-07T20:33:47.1401058Z         else:
2025-05-07T20:33:47.1401284Z             scale_ub_tensor = None
2025-05-07T20:33:47.1401567Z     
2025-05-07T20:33:47.1408656Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:47.1408996Z             op = silu_mul_quant
2025-05-07T20:33:47.1409290Z             if compiled:
2025-05-07T20:33:47.1409556Z                 op = torch.compile(op)
2025-05-07T20:33:47.1409863Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.1410141Z     
2025-05-07T20:33:47.1410335Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:47.1410502Z 
2025-05-07T20:33:47.1410606Z moe/activation_test.py:117: 
2025-05-07T20:33:47.1410906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.1411245Z moe/activation_test.py:115: in fn
2025-05-07T20:33:47.1411533Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.1412359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:47.1413091Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:47.1413661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:47.1414380Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:47.1415228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:47.1415786Z     kernel = self.compile(
2025-05-07T20:33:47.1416356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:47.1417051Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:47.1417475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.1417718Z 
2025-05-07T20:33:47.1417937Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1dcc75f0>
2025-05-07T20:33:47.1419067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:47.1420503Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1dc7de40>}
2025-05-07T20:33:47.1421925Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:47.1423010Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d2bedb0>
2025-05-07T20:33:47.1423319Z 
2025-05-07T20:33:47.1423495Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:47.1424048Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:47.1424538Z                            module_map=module_map)
2025-05-07T20:33:47.1424921Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:47.1425292Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:47.1425814Z E       ^
2025-05-07T20:33:47.1426454Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:47.1427014Z 
2025-05-07T20:33:47.1427451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:47.1428001Z 
2025-05-07T20:33:47.1428189Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:47.1428611Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:47.1429027Z     T=4096,
2025-05-07T20:33:47.1429219Z     D=7168,
2025-05-07T20:33:47.1429476Z     scale_ub=1200.0,
2025-05-07T20:33:47.1429698Z     contiguous=False,
2025-05-07T20:33:47.1429927Z     compiled=False,
2025-05-07T20:33:47.1430130Z )
2025-05-07T20:33:47.1430446Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:47.1430959Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:47.1431246Z 
2025-05-07T20:33:47.1431328Z     @given(
2025-05-07T20:33:47.1431554Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:47.1431877Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:47.1432186Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:47.1432517Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:47.1432849Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:47.1433143Z     )
2025-05-07T20:33:47.1433558Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:47.1434014Z     def test_silu_mul_quant(
2025-05-07T20:33:47.1434257Z         self,
2025-05-07T20:33:47.1434452Z         T: int,
2025-05-07T20:33:47.1434644Z         D: int,
2025-05-07T20:33:47.1434864Z         scale_ub: Optional[float],
2025-05-07T20:33:47.1435136Z         contiguous: bool,
2025-05-07T20:33:47.1435369Z         compiled: bool,
2025-05-07T20:33:47.1435593Z     ) -> None:
2025-05-07T20:33:47.1435804Z         torch.manual_seed(2025)
2025-05-07T20:33:47.1436041Z     
2025-05-07T20:33:47.1436316Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:47.1436669Z     
2025-05-07T20:33:47.1436863Z         x_sign = torch.sign(x)
2025-05-07T20:33:47.1437151Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:47.1437463Z         x = x_sign * x_clamp
2025-05-07T20:33:47.1437704Z         x0 = x[:, :D]
2025-05-07T20:33:47.1437907Z         x1 = x[:, D:]
2025-05-07T20:33:47.1438111Z     
2025-05-07T20:33:47.1438297Z         if contiguous:
2025-05-07T20:33:47.1438524Z             x0 = x0.contiguous()
2025-05-07T20:33:47.1438813Z             x1 = x1.contiguous()
2025-05-07T20:33:47.1439060Z     
2025-05-07T20:33:47.1439238Z         if scale_ub is not None:
2025-05-07T20:33:47.1439508Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:47.1439841Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:47.1440148Z             )
2025-05-07T20:33:47.1440337Z         else:
2025-05-07T20:33:47.1440540Z             scale_ub_tensor = None
2025-05-07T20:33:47.1440783Z     
2025-05-07T20:33:47.1441004Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:47.1441318Z             op = silu_mul_quant
2025-05-07T20:33:47.1441561Z             if compiled:
2025-05-07T20:33:47.1441799Z                 op = torch.compile(op)
2025-05-07T20:33:47.1442092Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.1442363Z     
2025-05-07T20:33:47.1442547Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:47.1442714Z 
2025-05-07T20:33:47.1442810Z moe/activation_test.py:117: 
2025-05-07T20:33:47.1443103Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.1443425Z moe/activation_test.py:115: in fn
2025-05-07T20:33:47.1443702Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.1444410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:47.1445185Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:47.1445735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:47.1446449Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:47.1447182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:47.1447735Z     kernel = self.compile(
2025-05-07T20:33:47.1448300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:47.1449024Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:47.1449433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.1449670Z 
2025-05-07T20:33:47.1449881Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1dcc6ea0>
2025-05-07T20:33:47.1451006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:47.1452429Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1dc7f380>}
2025-05-07T20:33:47.1455075Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:47.1456169Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d3ce7b0>
2025-05-07T20:33:47.1456467Z 
2025-05-07T20:33:47.1456636Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:47.1457353Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:47.1457840Z                            module_map=module_map)
2025-05-07T20:33:47.1458213Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:47.1458576Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:47.1458843Z E       ^
2025-05-07T20:33:47.1459323Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:47.1459795Z 
2025-05-07T20:33:47.1460236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:47.1460784Z 
2025-05-07T20:33:47.1460887Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:47.1461307Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:47.1461720Z     T=16384,
2025-05-07T20:33:47.1461905Z     D=7168,
2025-05-07T20:33:47.1462087Z     scale_ub=None,
2025-05-07T20:33:47.1462293Z     contiguous=True,
2025-05-07T20:33:47.1462510Z     compiled=True,
2025-05-07T20:33:47.1462698Z )
2025-05-07T20:33:47.3210892Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:47.3211662Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:47.3212070Z 
2025-05-07T20:33:47.3212192Z     @given(
2025-05-07T20:33:47.3212509Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:47.3212916Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:47.3213240Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:47.3213586Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:47.3213916Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:47.3214211Z     )
2025-05-07T20:33:47.3214715Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:47.3215180Z     def test_silu_mul_quant(
2025-05-07T20:33:47.3215434Z         self,
2025-05-07T20:33:47.3215763Z         T: int,
2025-05-07T20:33:47.3215957Z         D: int,
2025-05-07T20:33:47.3216176Z         scale_ub: Optional[float],
2025-05-07T20:33:47.3216457Z         contiguous: bool,
2025-05-07T20:33:47.3216700Z         compiled: bool,
2025-05-07T20:33:47.3216921Z     ) -> None:
2025-05-07T20:33:47.3217203Z         torch.manual_seed(2025)
2025-05-07T20:33:47.3217461Z     
2025-05-07T20:33:47.3217735Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:47.3218090Z     
2025-05-07T20:33:47.3218284Z         x_sign = torch.sign(x)
2025-05-07T20:33:47.3218636Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:47.3218957Z         x = x_sign * x_clamp
2025-05-07T20:33:47.3219201Z         x0 = x[:, :D]
2025-05-07T20:33:47.3219418Z         x1 = x[:, D:]
2025-05-07T20:33:47.3219629Z     
2025-05-07T20:33:47.3219816Z         if contiguous:
2025-05-07T20:33:47.3220044Z             x0 = x0.contiguous()
2025-05-07T20:33:47.3220308Z             x1 = x1.contiguous()
2025-05-07T20:33:47.3220561Z     
2025-05-07T20:33:47.3220754Z         if scale_ub is not None:
2025-05-07T20:33:47.3221026Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:47.3221363Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:47.3221680Z             )
2025-05-07T20:33:47.3221872Z         else:
2025-05-07T20:33:47.3222095Z             scale_ub_tensor = None
2025-05-07T20:33:47.3222354Z     
2025-05-07T20:33:47.3222647Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:47.3222974Z             op = silu_mul_quant
2025-05-07T20:33:47.3223237Z             if compiled:
2025-05-07T20:33:47.3223487Z                 op = torch.compile(op)
2025-05-07T20:33:47.3223798Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.3224084Z     
2025-05-07T20:33:47.3224285Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:47.3224452Z 
2025-05-07T20:33:47.3224561Z moe/activation_test.py:117: 
2025-05-07T20:33:47.3224871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.3225216Z moe/activation_test.py:115: in fn
2025-05-07T20:33:47.3225760Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.3226350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:47.3226945Z     return fn(*args, **kwargs)
2025-05-07T20:33:47.3227639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:47.3228370Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:47.3228938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:47.3229649Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:47.3230343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:47.3230908Z     kernel = self.compile(
2025-05-07T20:33:47.3231474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:47.3232165Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:47.3232581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.3232825Z 
2025-05-07T20:33:47.3233043Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d644920>
2025-05-07T20:33:47.3234165Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:47.3235603Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d6b44a0>}
2025-05-07T20:33:47.3237096Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:47.3238208Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d60f930>
2025-05-07T20:33:47.3238599Z 
2025-05-07T20:33:47.3238774Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:47.3239323Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:47.3239867Z                            module_map=module_map)
2025-05-07T20:33:47.3240243Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:47.3240608Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:47.3240874Z E       ^
2025-05-07T20:33:47.3241357Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:47.3241839Z 
2025-05-07T20:33:47.3242278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:47.3242822Z 
2025-05-07T20:33:47.3242932Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:47.3243356Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:47.3243777Z     T=4096,
2025-05-07T20:33:47.3243971Z     D=5120,
2025-05-07T20:33:47.3244156Z     scale_ub=None,
2025-05-07T20:33:47.3244437Z     contiguous=False,
2025-05-07T20:33:47.3244665Z     compiled=True,
2025-05-07T20:33:47.3244870Z )
2025-05-07T20:33:47.3245185Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:47.3245691Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:47.3245971Z 
2025-05-07T20:33:47.3246056Z     @given(
2025-05-07T20:33:47.3246280Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:47.3246597Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:47.3246908Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:47.3247238Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:47.3247570Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:47.3247859Z     )
2025-05-07T20:33:47.3248213Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:47.3248665Z     def test_silu_mul_quant(
2025-05-07T20:33:47.3248907Z         self,
2025-05-07T20:33:47.3249118Z         T: int,
2025-05-07T20:33:47.3249352Z         D: int,
2025-05-07T20:33:47.3249576Z         scale_ub: Optional[float],
2025-05-07T20:33:47.3249849Z         contiguous: bool,
2025-05-07T20:33:47.3250088Z         compiled: bool,
2025-05-07T20:33:47.3250317Z     ) -> None:
2025-05-07T20:33:47.3250541Z         torch.manual_seed(2025)
2025-05-07T20:33:47.3250777Z     
2025-05-07T20:33:47.3251055Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:47.3251413Z     
2025-05-07T20:33:47.3251603Z         x_sign = torch.sign(x)
2025-05-07T20:33:47.3251893Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:47.3252206Z         x = x_sign * x_clamp
2025-05-07T20:33:47.3252440Z         x0 = x[:, :D]
2025-05-07T20:33:47.3252657Z         x1 = x[:, D:]
2025-05-07T20:33:47.3252864Z     
2025-05-07T20:33:47.3253045Z         if contiguous:
2025-05-07T20:33:47.3253279Z             x0 = x0.contiguous()
2025-05-07T20:33:47.3253539Z             x1 = x1.contiguous()
2025-05-07T20:33:47.3253785Z     
2025-05-07T20:33:47.3253973Z         if scale_ub is not None:
2025-05-07T20:33:47.3254251Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:47.3254709Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:47.3255016Z             )
2025-05-07T20:33:47.3255207Z         else:
2025-05-07T20:33:47.3255414Z             scale_ub_tensor = None
2025-05-07T20:33:47.3255657Z     
2025-05-07T20:33:47.3255941Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:47.3256267Z             op = silu_mul_quant
2025-05-07T20:33:47.3256519Z             if compiled:
2025-05-07T20:33:47.3256763Z                 op = torch.compile(op)
2025-05-07T20:33:47.3257062Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.3257372Z     
2025-05-07T20:33:47.3257570Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:47.3257732Z 
2025-05-07T20:33:47.3257833Z moe/activation_test.py:117: 
2025-05-07T20:33:47.3258131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.3258506Z moe/activation_test.py:115: in fn
2025-05-07T20:33:47.3258792Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.3259366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:47.3259946Z     return fn(*args, **kwargs)
2025-05-07T20:33:47.3260631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:47.3261354Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:47.3261909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:47.3262621Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:47.3263358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:47.3263922Z     kernel = self.compile(
2025-05-07T20:33:47.3264490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:47.3265183Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:47.3265597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.3265838Z 
2025-05-07T20:33:47.3266053Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d647c20>
2025-05-07T20:33:47.3267188Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:47.3268619Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d6b51c0>}
2025-05-07T20:33:47.3270081Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:47.3271168Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d2b6770>
2025-05-07T20:33:47.3271469Z 
2025-05-07T20:33:47.3271644Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:47.3272180Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:47.3272665Z                            module_map=module_map)
2025-05-07T20:33:47.3273038Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:47.3273395Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:47.3273663Z E       ^
2025-05-07T20:33:47.3274144Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:47.3274617Z 
2025-05-07T20:33:47.3275060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:47.3275606Z 
2025-05-07T20:33:47.6522570Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:47.6523200Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:47.6523779Z     T=4096,
2025-05-07T20:33:47.6524028Z     D=5120,
2025-05-07T20:33:47.6524426Z     scale_ub=1200.0,
2025-05-07T20:33:47.6524766Z     contiguous=False,
2025-05-07T20:33:47.6525080Z     compiled=False,
2025-05-07T20:33:47.6525344Z )
2025-05-07T20:33:47.6525955Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:47.6526680Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:47.6526975Z 
2025-05-07T20:33:47.6527052Z     @given(
2025-05-07T20:33:47.6527275Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:47.6527586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:47.6527962Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:47.6528291Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:47.6528610Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:47.6528897Z     )
2025-05-07T20:33:47.6529241Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:47.6529687Z     def test_silu_mul_quant(
2025-05-07T20:33:47.6529931Z         self,
2025-05-07T20:33:47.6530120Z         T: int,
2025-05-07T20:33:47.6530308Z         D: int,
2025-05-07T20:33:47.6530514Z         scale_ub: Optional[float],
2025-05-07T20:33:47.6530778Z         contiguous: bool,
2025-05-07T20:33:47.6531010Z         compiled: bool,
2025-05-07T20:33:47.6531234Z     ) -> None:
2025-05-07T20:33:47.6531451Z         torch.manual_seed(2025)
2025-05-07T20:33:47.6531688Z     
2025-05-07T20:33:47.6532021Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:47.6532373Z     
2025-05-07T20:33:47.6532568Z         x_sign = torch.sign(x)
2025-05-07T20:33:47.6532852Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:47.6533166Z         x = x_sign * x_clamp
2025-05-07T20:33:47.6533405Z         x0 = x[:, :D]
2025-05-07T20:33:47.6533614Z         x1 = x[:, D:]
2025-05-07T20:33:47.6533821Z     
2025-05-07T20:33:47.6534009Z         if contiguous:
2025-05-07T20:33:47.6534236Z             x0 = x0.contiguous()
2025-05-07T20:33:47.6534493Z             x1 = x1.contiguous()
2025-05-07T20:33:47.6534817Z     
2025-05-07T20:33:47.6534998Z         if scale_ub is not None:
2025-05-07T20:33:47.6535267Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:47.6535602Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:47.6535917Z             )
2025-05-07T20:33:47.6536101Z         else:
2025-05-07T20:33:47.6536310Z             scale_ub_tensor = None
2025-05-07T20:33:47.6536561Z     
2025-05-07T20:33:47.6536786Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:47.6537104Z             op = silu_mul_quant
2025-05-07T20:33:47.6537355Z             if compiled:
2025-05-07T20:33:47.6537593Z                 op = torch.compile(op)
2025-05-07T20:33:47.6537888Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.6538160Z     
2025-05-07T20:33:47.6538340Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:47.6538506Z 
2025-05-07T20:33:47.6538602Z moe/activation_test.py:117: 
2025-05-07T20:33:47.6538895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.6539231Z moe/activation_test.py:115: in fn
2025-05-07T20:33:47.6539505Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.6540224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:47.6540945Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:47.6541496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:47.6542211Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:47.6542899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:47.6543453Z     kernel = self.compile(
2025-05-07T20:33:47.6544007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:47.6544768Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:47.6545171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.6545403Z 
2025-05-07T20:33:47.6545654Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d647b30>
2025-05-07T20:33:47.6546782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:47.6548247Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d6b6160>}
2025-05-07T20:33:47.6549648Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:47.6550728Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d25eeb0>
2025-05-07T20:33:47.6551033Z 
2025-05-07T20:33:47.6551199Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:47.6551737Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:47.6552266Z                            module_map=module_map)
2025-05-07T20:33:47.6552635Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:47.6552997Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:47.6553268Z E       ^
2025-05-07T20:33:47.6553749Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:47.6554218Z 
2025-05-07T20:33:47.6554653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:47.6555204Z 
2025-05-07T20:33:47.6555310Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:47.6555730Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:47.6556154Z     T=4096,
2025-05-07T20:33:47.6556333Z     D=5120,
2025-05-07T20:33:47.6556530Z     scale_ub=1200.0,
2025-05-07T20:33:47.6556749Z     contiguous=False,
2025-05-07T20:33:47.6564037Z     compiled=True,
2025-05-07T20:33:47.6564259Z )
2025-05-07T20:33:47.6564593Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:47.6565118Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:47.6565404Z 
2025-05-07T20:33:47.6565490Z     @given(
2025-05-07T20:33:47.6565715Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:47.6566030Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:47.6566336Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:47.6566669Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:47.6567005Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:47.6567298Z     )
2025-05-07T20:33:47.6567647Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:47.6568099Z     def test_silu_mul_quant(
2025-05-07T20:33:47.6568346Z         self,
2025-05-07T20:33:47.6568542Z         T: int,
2025-05-07T20:33:47.6568729Z         D: int,
2025-05-07T20:33:47.6568951Z         scale_ub: Optional[float],
2025-05-07T20:33:47.6569226Z         contiguous: bool,
2025-05-07T20:33:47.6569464Z         compiled: bool,
2025-05-07T20:33:47.6569692Z     ) -> None:
2025-05-07T20:33:47.6569914Z         torch.manual_seed(2025)
2025-05-07T20:33:47.6570156Z     
2025-05-07T20:33:47.6570433Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:47.6570783Z     
2025-05-07T20:33:47.6570971Z         x_sign = torch.sign(x)
2025-05-07T20:33:47.6571344Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:47.6571657Z         x = x_sign * x_clamp
2025-05-07T20:33:47.6571903Z         x0 = x[:, :D]
2025-05-07T20:33:47.6572117Z         x1 = x[:, D:]
2025-05-07T20:33:47.6572324Z     
2025-05-07T20:33:47.6572513Z         if contiguous:
2025-05-07T20:33:47.6572786Z             x0 = x0.contiguous()
2025-05-07T20:33:47.6573049Z             x1 = x1.contiguous()
2025-05-07T20:33:47.6573294Z     
2025-05-07T20:33:47.6573478Z         if scale_ub is not None:
2025-05-07T20:33:47.6573752Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:47.6574132Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:47.6574439Z             )
2025-05-07T20:33:47.6574754Z         else:
2025-05-07T20:33:47.6574965Z             scale_ub_tensor = None
2025-05-07T20:33:47.6575212Z     
2025-05-07T20:33:47.6575443Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:47.6575765Z             op = silu_mul_quant
2025-05-07T20:33:47.6576010Z             if compiled:
2025-05-07T20:33:47.6576262Z                 op = torch.compile(op)
2025-05-07T20:33:47.6576560Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.6576833Z     
2025-05-07T20:33:47.6577020Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:47.6577192Z 
2025-05-07T20:33:47.6577290Z moe/activation_test.py:117: 
2025-05-07T20:33:47.6577589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.6577969Z moe/activation_test.py:115: in fn
2025-05-07T20:33:47.6578255Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.6578840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:47.6579417Z     return fn(*args, **kwargs)
2025-05-07T20:33:47.6580096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:47.6580814Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:47.6581370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:47.6582076Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:47.6582765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:47.6583315Z     kernel = self.compile(
2025-05-07T20:33:47.6583869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:47.6584559Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:47.6584960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.6585200Z 
2025-05-07T20:33:47.6585417Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d645d00>
2025-05-07T20:33:47.6586531Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:47.6587956Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d6b7240>}
2025-05-07T20:33:47.6589368Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:47.6590456Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d2a1870>
2025-05-07T20:33:47.6590755Z 
2025-05-07T20:33:47.6590929Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:47.6591460Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:47.6591990Z                            module_map=module_map)
2025-05-07T20:33:47.6592348Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:47.6592696Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:47.6592954Z E       ^
2025-05-07T20:33:47.6593474Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:47.6593945Z 
2025-05-07T20:33:47.6594389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:47.6594971Z 
2025-05-07T20:33:47.7736921Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:47.7737472Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:47.7738082Z     T=2048,
2025-05-07T20:33:47.7738335Z     D=7168,
2025-05-07T20:33:47.7738611Z     scale_ub=1200.0,
2025-05-07T20:33:47.7738934Z     contiguous=False,
2025-05-07T20:33:47.7739246Z     compiled=False,
2025-05-07T20:33:47.7739537Z )
2025-05-07T20:33:47.7739890Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:47.7740407Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:47.7740699Z 
2025-05-07T20:33:47.7740779Z     @given(
2025-05-07T20:33:47.7741022Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:47.7741340Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:47.7741770Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:47.7742113Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:47.7742448Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:47.7742748Z     )
2025-05-07T20:33:47.7743101Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:47.7743553Z     def test_silu_mul_quant(
2025-05-07T20:33:47.7743795Z         self,
2025-05-07T20:33:47.7743987Z         T: int,
2025-05-07T20:33:47.7744180Z         D: int,
2025-05-07T20:33:47.7744398Z         scale_ub: Optional[float],
2025-05-07T20:33:47.7744667Z         contiguous: bool,
2025-05-07T20:33:47.7744903Z         compiled: bool,
2025-05-07T20:33:47.7745115Z     ) -> None:
2025-05-07T20:33:47.7745326Z         torch.manual_seed(2025)
2025-05-07T20:33:47.7745564Z     
2025-05-07T20:33:47.7745829Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:47.7746175Z     
2025-05-07T20:33:47.7746364Z         x_sign = torch.sign(x)
2025-05-07T20:33:47.7746647Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:47.7746960Z         x = x_sign * x_clamp
2025-05-07T20:33:47.7747190Z         x0 = x[:, :D]
2025-05-07T20:33:47.7747398Z         x1 = x[:, D:]
2025-05-07T20:33:47.7747603Z     
2025-05-07T20:33:47.7747782Z         if contiguous:
2025-05-07T20:33:47.7748007Z             x0 = x0.contiguous()
2025-05-07T20:33:47.7748267Z             x1 = x1.contiguous()
2025-05-07T20:33:47.7748508Z     
2025-05-07T20:33:47.7748698Z         if scale_ub is not None:
2025-05-07T20:33:47.7748971Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:47.7749309Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:47.7749624Z             )
2025-05-07T20:33:47.7749810Z         else:
2025-05-07T20:33:47.7750028Z             scale_ub_tensor = None
2025-05-07T20:33:47.7750288Z     
2025-05-07T20:33:47.7750519Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:47.7750839Z             op = silu_mul_quant
2025-05-07T20:33:47.7751085Z             if compiled:
2025-05-07T20:33:47.7751326Z                 op = torch.compile(op)
2025-05-07T20:33:47.7751623Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.7751902Z     
2025-05-07T20:33:47.7752086Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:47.7752257Z 
2025-05-07T20:33:47.7752355Z moe/activation_test.py:117: 
2025-05-07T20:33:47.7752645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.7753045Z moe/activation_test.py:115: in fn
2025-05-07T20:33:47.7753334Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.7754055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:47.7754840Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:47.7755441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:47.7756220Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:47.7756985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:47.7757533Z     kernel = self.compile(
2025-05-07T20:33:47.7758087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:47.7758774Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:47.7759180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.7759451Z 
2025-05-07T20:33:47.7759668Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d245790>
2025-05-07T20:33:47.7760832Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:47.7762253Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d37c220>}
2025-05-07T20:33:47.7763651Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:47.7764745Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d450070>
2025-05-07T20:33:47.7765047Z 
2025-05-07T20:33:47.7765215Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:47.7765754Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:47.7766237Z                            module_map=module_map)
2025-05-07T20:33:47.7766618Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:47.7766991Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:47.7767254Z E       ^
2025-05-07T20:33:47.7767735Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:47.7768204Z 
2025-05-07T20:33:47.7768641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:47.7769243Z 
2025-05-07T20:33:47.7769353Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:47.7769780Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:47.7770200Z     T=1,
2025-05-07T20:33:47.7770388Z     D=7168,
2025-05-07T20:33:47.7770583Z     scale_ub=None,
2025-05-07T20:33:47.7770800Z     contiguous=True,
2025-05-07T20:33:47.7771026Z     compiled=False,
2025-05-07T20:33:47.7771234Z )
2025-05-07T20:33:47.7771561Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:47.7772068Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:47.7772343Z 
2025-05-07T20:33:47.7772450Z     @given(
2025-05-07T20:33:47.7772676Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:47.7772994Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:47.7773304Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:47.7773628Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:47.7773958Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:47.7774300Z     )
2025-05-07T20:33:47.7774780Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:47.7775231Z     def test_silu_mul_quant(
2025-05-07T20:33:47.7775470Z         self,
2025-05-07T20:33:47.7775654Z         T: int,
2025-05-07T20:33:47.7775889Z         D: int,
2025-05-07T20:33:47.7776102Z         scale_ub: Optional[float],
2025-05-07T20:33:47.7776366Z         contiguous: bool,
2025-05-07T20:33:47.7776612Z         compiled: bool,
2025-05-07T20:33:47.7776840Z     ) -> None:
2025-05-07T20:33:47.7777122Z         torch.manual_seed(2025)
2025-05-07T20:33:47.7777354Z     
2025-05-07T20:33:47.7777631Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:47.7777982Z     
2025-05-07T20:33:47.7778171Z         x_sign = torch.sign(x)
2025-05-07T20:33:47.7778459Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:47.7778770Z         x = x_sign * x_clamp
2025-05-07T20:33:47.7779004Z         x0 = x[:, :D]
2025-05-07T20:33:47.7779218Z         x1 = x[:, D:]
2025-05-07T20:33:47.7779430Z     
2025-05-07T20:33:47.7779604Z         if contiguous:
2025-05-07T20:33:47.7779834Z             x0 = x0.contiguous()
2025-05-07T20:33:47.7780097Z             x1 = x1.contiguous()
2025-05-07T20:33:47.7780330Z     
2025-05-07T20:33:47.7780521Z         if scale_ub is not None:
2025-05-07T20:33:47.7780793Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:47.7781170Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:47.7781481Z             )
2025-05-07T20:33:47.7781675Z         else:
2025-05-07T20:33:47.7781880Z             scale_ub_tensor = None
2025-05-07T20:33:47.7782134Z     
2025-05-07T20:33:47.7782361Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:47.7782674Z             op = silu_mul_quant
2025-05-07T20:33:47.7782918Z             if compiled:
2025-05-07T20:33:47.7783164Z                 op = torch.compile(op)
2025-05-07T20:33:47.7783461Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.7783733Z     
2025-05-07T20:33:47.7783926Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:47.7784087Z 
2025-05-07T20:33:47.7784185Z moe/activation_test.py:117: 
2025-05-07T20:33:47.7784481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.7784817Z moe/activation_test.py:115: in fn
2025-05-07T20:33:47.7785098Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:47.7785816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:47.7786541Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:47.7787098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:47.7787811Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:47.7788499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:47.7789063Z     kernel = self.compile(
2025-05-07T20:33:47.7789625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:47.7790314Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:47.7790718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:47.7790960Z 
2025-05-07T20:33:47.7791172Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d244e60>
2025-05-07T20:33:47.7792300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:47.7793723Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d37d120>}
2025-05-07T20:33:47.7795169Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:47.7796284Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d4640f0>
2025-05-07T20:33:47.7796583Z 
2025-05-07T20:33:47.7796752Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:47.7797289Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:47.7797804Z                            module_map=module_map)
2025-05-07T20:33:47.7798173Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:47.7798531Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:47.7798789Z E       ^
2025-05-07T20:33:47.7799262Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:47.7799787Z 
2025-05-07T20:33:47.7800222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:47.7800762Z 
2025-05-07T20:33:47.7800874Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:47.7801285Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:47.7801701Z     T=16384,
2025-05-07T20:33:47.7801938Z     D=7168,
2025-05-07T20:33:47.7802128Z     scale_ub=1200.0,
2025-05-07T20:33:47.7802343Z     contiguous=False,
2025-05-07T20:33:47.7802567Z     compiled=True,
2025-05-07T20:33:48.0210227Z )
2025-05-07T20:33:48.0210701Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:48.0211431Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:48.0211829Z 
2025-05-07T20:33:48.0211946Z     @given(
2025-05-07T20:33:48.0212307Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:48.0212730Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:48.0213037Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:48.0213366Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:48.0213699Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:48.0213982Z     )
2025-05-07T20:33:48.0214325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:48.0214920Z     def test_silu_mul_quant(
2025-05-07T20:33:48.0215169Z         self,
2025-05-07T20:33:48.0215361Z         T: int,
2025-05-07T20:33:48.0215562Z         D: int,
2025-05-07T20:33:48.0215813Z         scale_ub: Optional[float],
2025-05-07T20:33:48.0216147Z         contiguous: bool,
2025-05-07T20:33:48.0216387Z         compiled: bool,
2025-05-07T20:33:48.0216608Z     ) -> None:
2025-05-07T20:33:48.0216813Z         torch.manual_seed(2025)
2025-05-07T20:33:48.0217052Z     
2025-05-07T20:33:48.0217333Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:48.0217679Z     
2025-05-07T20:33:48.0217869Z         x_sign = torch.sign(x)
2025-05-07T20:33:48.0218162Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:48.0218468Z         x = x_sign * x_clamp
2025-05-07T20:33:48.0218715Z         x0 = x[:, :D]
2025-05-07T20:33:48.0218950Z         x1 = x[:, D:]
2025-05-07T20:33:48.0219175Z     
2025-05-07T20:33:48.0219353Z         if contiguous:
2025-05-07T20:33:48.0219586Z             x0 = x0.contiguous()
2025-05-07T20:33:48.0219844Z             x1 = x1.contiguous()
2025-05-07T20:33:48.0220080Z     
2025-05-07T20:33:48.0220272Z         if scale_ub is not None:
2025-05-07T20:33:48.0220546Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:48.0220879Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:48.0221187Z             )
2025-05-07T20:33:48.0221381Z         else:
2025-05-07T20:33:48.0221701Z             scale_ub_tensor = None
2025-05-07T20:33:48.0221950Z     
2025-05-07T20:33:48.0222174Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:48.0222480Z             op = silu_mul_quant
2025-05-07T20:33:48.0222724Z             if compiled:
2025-05-07T20:33:48.0222966Z                 op = torch.compile(op)
2025-05-07T20:33:48.0223320Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.0223601Z     
2025-05-07T20:33:48.0223791Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:48.0223953Z 
2025-05-07T20:33:48.0224058Z moe/activation_test.py:117: 
2025-05-07T20:33:48.0224416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.0224750Z moe/activation_test.py:115: in fn
2025-05-07T20:33:48.0225029Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.0225842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:48.0226434Z     return fn(*args, **kwargs)
2025-05-07T20:33:48.0227126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:48.0227849Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:48.0228404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:48.0229120Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:48.0229893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:48.0230451Z     kernel = self.compile(
2025-05-07T20:33:48.0231019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:48.0231715Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:48.0232128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.0232372Z 
2025-05-07T20:33:48.0232586Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d245df0>
2025-05-07T20:33:48.0233715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:48.0235146Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d37e520>}
2025-05-07T20:33:48.0236559Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:48.0237643Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d317bf0>
2025-05-07T20:33:48.0237941Z 
2025-05-07T20:33:48.0238110Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:48.0238660Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:48.0239151Z                            module_map=module_map)
2025-05-07T20:33:48.0239524Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:48.0239884Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:48.0240150Z E       ^
2025-05-07T20:33:48.0240632Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:48.0241114Z 
2025-05-07T20:33:48.0241552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:48.0242102Z 
2025-05-07T20:33:48.0242208Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:48.0242634Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:48.0243183Z     T=1,
2025-05-07T20:33:48.0243370Z     D=7168,
2025-05-07T20:33:48.0243561Z     scale_ub=None,
2025-05-07T20:33:48.0243771Z     contiguous=False,
2025-05-07T20:33:48.0243997Z     compiled=False,
2025-05-07T20:33:48.0244198Z )
2025-05-07T20:33:48.0244588Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:48.0245089Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:48.0245362Z 
2025-05-07T20:33:48.0245436Z     @given(
2025-05-07T20:33:48.0245662Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:48.0246033Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:48.0246338Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:48.0246671Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:48.0246998Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:48.0247287Z     )
2025-05-07T20:33:48.0247638Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:48.0248098Z     def test_silu_mul_quant(
2025-05-07T20:33:48.0248334Z         self,
2025-05-07T20:33:48.0248527Z         T: int,
2025-05-07T20:33:48.0248720Z         D: int,
2025-05-07T20:33:48.0248939Z         scale_ub: Optional[float],
2025-05-07T20:33:48.0249253Z         contiguous: bool,
2025-05-07T20:33:48.0249498Z         compiled: bool,
2025-05-07T20:33:48.0249710Z     ) -> None:
2025-05-07T20:33:48.0249917Z         torch.manual_seed(2025)
2025-05-07T20:33:48.0250204Z     
2025-05-07T20:33:48.0250472Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:48.0250830Z     
2025-05-07T20:33:48.0251024Z         x_sign = torch.sign(x)
2025-05-07T20:33:48.0251305Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:48.0251614Z         x = x_sign * x_clamp
2025-05-07T20:33:48.0251847Z         x0 = x[:, :D]
2025-05-07T20:33:48.0252047Z         x1 = x[:, D:]
2025-05-07T20:33:48.0252248Z     
2025-05-07T20:33:48.0252435Z         if contiguous:
2025-05-07T20:33:48.0252662Z             x0 = x0.contiguous()
2025-05-07T20:33:48.0252916Z             x1 = x1.contiguous()
2025-05-07T20:33:48.0253149Z     
2025-05-07T20:33:48.0253338Z         if scale_ub is not None:
2025-05-07T20:33:48.0253605Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:48.0253947Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:48.0254259Z             )
2025-05-07T20:33:48.0254439Z         else:
2025-05-07T20:33:48.0261479Z             scale_ub_tensor = None
2025-05-07T20:33:48.0261785Z     
2025-05-07T20:33:48.0262035Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:48.0262361Z             op = silu_mul_quant
2025-05-07T20:33:48.0262624Z             if compiled:
2025-05-07T20:33:48.0262887Z                 op = torch.compile(op)
2025-05-07T20:33:48.0263182Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.0263463Z     
2025-05-07T20:33:48.0263656Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:48.0263824Z 
2025-05-07T20:33:48.0263924Z moe/activation_test.py:117: 
2025-05-07T20:33:48.0264235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.0264585Z moe/activation_test.py:115: in fn
2025-05-07T20:33:48.0264885Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.0265601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:48.0266332Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:48.0266891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:48.0267597Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:48.0268285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:48.0268851Z     kernel = self.compile(
2025-05-07T20:33:48.0269486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:48.0270174Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:48.0270628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.0270869Z 
2025-05-07T20:33:48.0271080Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d247530>
2025-05-07T20:33:48.0272213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:48.0273827Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d37f100>}
2025-05-07T20:33:48.0275395Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:48.0276579Z context = <triton._C.libtriton.ir.context object at 0x7f1c083c6f30>
2025-05-07T20:33:48.0276937Z 
2025-05-07T20:33:48.0277121Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:48.0277805Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:48.0278304Z                            module_map=module_map)
2025-05-07T20:33:48.0278697Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:48.0279063Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:48.0279329Z E       ^
2025-05-07T20:33:48.0279811Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:48.0280283Z 
2025-05-07T20:33:48.0280725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:48.0281270Z 
2025-05-07T20:33:48.0281372Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:48.0281786Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:48.0282201Z     T=2048,
2025-05-07T20:33:48.0282392Z     D=7168,
2025-05-07T20:33:48.0282573Z     scale_ub=None,
2025-05-07T20:33:48.0282791Z     contiguous=False,
2025-05-07T20:33:48.0283018Z     compiled=True,
2025-05-07T20:33:48.0283210Z )
2025-05-07T20:33:48.1153983Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:48.1155497Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:48.1156272Z 
2025-05-07T20:33:48.1156489Z     @given(
2025-05-07T20:33:48.1157099Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:48.1157720Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:48.1158338Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:48.1158942Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:48.1159317Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:48.1159606Z     )
2025-05-07T20:33:48.1159953Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:48.1160399Z     def test_silu_mul_quant(
2025-05-07T20:33:48.1160638Z         self,
2025-05-07T20:33:48.1160840Z         T: int,
2025-05-07T20:33:48.1161026Z         D: int,
2025-05-07T20:33:48.1161239Z         scale_ub: Optional[float],
2025-05-07T20:33:48.1161508Z         contiguous: bool,
2025-05-07T20:33:48.1161736Z         compiled: bool,
2025-05-07T20:33:48.1161953Z     ) -> None:
2025-05-07T20:33:48.1162162Z         torch.manual_seed(2025)
2025-05-07T20:33:48.1162395Z     
2025-05-07T20:33:48.1162668Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:48.1164434Z     
2025-05-07T20:33:48.1164737Z         x_sign = torch.sign(x)
2025-05-07T20:33:48.1165025Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:48.1165335Z         x = x_sign * x_clamp
2025-05-07T20:33:48.1165571Z         x0 = x[:, :D]
2025-05-07T20:33:48.1165781Z         x1 = x[:, D:]
2025-05-07T20:33:48.1165984Z     
2025-05-07T20:33:48.1166224Z         if contiguous:
2025-05-07T20:33:48.1166453Z             x0 = x0.contiguous()
2025-05-07T20:33:48.1166708Z             x1 = x1.contiguous()
2025-05-07T20:33:48.1166953Z     
2025-05-07T20:33:48.1167139Z         if scale_ub is not None:
2025-05-07T20:33:48.1167468Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:48.1167795Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:48.1168098Z             )
2025-05-07T20:33:48.1168283Z         else:
2025-05-07T20:33:48.1168484Z             scale_ub_tensor = None
2025-05-07T20:33:48.1168728Z     
2025-05-07T20:33:48.1168956Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:48.1169273Z             op = silu_mul_quant
2025-05-07T20:33:48.1169519Z             if compiled:
2025-05-07T20:33:48.1169755Z                 op = torch.compile(op)
2025-05-07T20:33:48.1170054Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.1170334Z     
2025-05-07T20:33:48.1170523Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:48.1170689Z 
2025-05-07T20:33:48.1170788Z moe/activation_test.py:117: 
2025-05-07T20:33:48.1171153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.1171483Z moe/activation_test.py:115: in fn
2025-05-07T20:33:48.1171765Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.1172342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:48.1172922Z     return fn(*args, **kwargs)
2025-05-07T20:33:48.1173596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:48.1174316Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:48.1175075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:48.1175783Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:48.1176473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:48.1177030Z     kernel = self.compile(
2025-05-07T20:33:48.1177588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:48.1178267Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:48.1178678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.1178915Z 
2025-05-07T20:33:48.1179128Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c083973e0>
2025-05-07T20:33:48.1180252Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:48.1181665Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c08400720>}
2025-05-07T20:33:48.1183066Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:48.1184144Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d4a94b0>
2025-05-07T20:33:48.1184441Z 
2025-05-07T20:33:48.1184610Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:48.1185135Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:48.1185657Z                            module_map=module_map)
2025-05-07T20:33:48.1186023Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:48.1186374Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:48.1186627Z E       ^
2025-05-07T20:33:48.1187139Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:48.1187611Z 
2025-05-07T20:33:48.1188049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:48.1188628Z 
2025-05-07T20:33:48.1188735Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:48.1189152Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:48.1189568Z     T=4096,
2025-05-07T20:33:48.1189753Z     D=7168,
2025-05-07T20:33:48.1189941Z     scale_ub=None,
2025-05-07T20:33:48.1190162Z     contiguous=False,
2025-05-07T20:33:48.1190389Z     compiled=True,
2025-05-07T20:33:48.1190588Z )
2025-05-07T20:33:48.1190909Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:48.1191418Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:48.1191698Z 
2025-05-07T20:33:48.1191779Z     @given(
2025-05-07T20:33:48.1192010Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:48.1192369Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:48.1192676Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:48.1193014Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:48.1193347Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:48.1193639Z     )
2025-05-07T20:33:48.1193985Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:48.1194450Z     def test_silu_mul_quant(
2025-05-07T20:33:48.1194696Z         self,
2025-05-07T20:33:48.1194890Z         T: int,
2025-05-07T20:33:48.1195090Z         D: int,
2025-05-07T20:33:48.1195303Z         scale_ub: Optional[float],
2025-05-07T20:33:48.1195567Z         contiguous: bool,
2025-05-07T20:33:48.1195802Z         compiled: bool,
2025-05-07T20:33:48.1196017Z     ) -> None:
2025-05-07T20:33:48.1196223Z         torch.manual_seed(2025)
2025-05-07T20:33:48.1196457Z     
2025-05-07T20:33:48.1196725Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:48.1197073Z     
2025-05-07T20:33:48.1197257Z         x_sign = torch.sign(x)
2025-05-07T20:33:48.1197543Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:48.1197853Z         x = x_sign * x_clamp
2025-05-07T20:33:48.1198081Z         x0 = x[:, :D]
2025-05-07T20:33:48.1198286Z         x1 = x[:, D:]
2025-05-07T20:33:48.1198492Z     
2025-05-07T20:33:48.1198667Z         if contiguous:
2025-05-07T20:33:48.1198893Z             x0 = x0.contiguous()
2025-05-07T20:33:48.1199145Z             x1 = x1.contiguous()
2025-05-07T20:33:48.1199388Z     
2025-05-07T20:33:48.1199604Z         if scale_ub is not None:
2025-05-07T20:33:48.1199880Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:48.1200224Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:48.1200540Z             )
2025-05-07T20:33:48.1200722Z         else:
2025-05-07T20:33:48.1200928Z             scale_ub_tensor = None
2025-05-07T20:33:48.1201174Z     
2025-05-07T20:33:48.1201405Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:48.1201723Z             op = silu_mul_quant
2025-05-07T20:33:48.1201965Z             if compiled:
2025-05-07T20:33:48.1202202Z                 op = torch.compile(op)
2025-05-07T20:33:48.1202498Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.1202767Z     
2025-05-07T20:33:48.1202961Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:48.1203125Z 
2025-05-07T20:33:48.1203227Z moe/activation_test.py:117: 
2025-05-07T20:33:48.1203579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.1203914Z moe/activation_test.py:115: in fn
2025-05-07T20:33:48.1204190Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.1204806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:48.1205389Z     return fn(*args, **kwargs)
2025-05-07T20:33:48.1206083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:48.1206872Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:48.1207436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:48.1208152Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:48.1208859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:48.1209471Z     kernel = self.compile(
2025-05-07T20:33:48.1210028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:48.1210723Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:48.1211135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.1211466Z 
2025-05-07T20:33:48.1211761Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c08394ec0>
2025-05-07T20:33:48.1212885Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:48.1214319Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c08401440>}
2025-05-07T20:33:48.1215813Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:48.1216907Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d46bc70>
2025-05-07T20:33:48.1217204Z 
2025-05-07T20:33:48.1217370Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:48.1217911Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:48.1218418Z                            module_map=module_map)
2025-05-07T20:33:48.1218854Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:48.1219210Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:48.1219475Z E       ^
2025-05-07T20:33:48.1219949Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:48.1220423Z 
2025-05-07T20:33:48.1220859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:48.1221406Z 
2025-05-07T20:33:48.2808123Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:48.2808748Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:48.2809399Z     T=16384,
2025-05-07T20:33:48.2809675Z     D=5120,
2025-05-07T20:33:48.2809934Z     scale_ub=1200.0,
2025-05-07T20:33:48.2810247Z     contiguous=False,
2025-05-07T20:33:48.2810554Z     compiled=False,
2025-05-07T20:33:48.2810788Z )
2025-05-07T20:33:48.2811107Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:48.2811616Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:48.2811910Z 
2025-05-07T20:33:48.2811987Z     @given(
2025-05-07T20:33:48.2812205Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:48.2812655Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:48.2812952Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:48.2813283Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:48.2813613Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:48.2813893Z     )
2025-05-07T20:33:48.2814307Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:48.2814867Z     def test_silu_mul_quant(
2025-05-07T20:33:48.2815104Z         self,
2025-05-07T20:33:48.2815293Z         T: int,
2025-05-07T20:33:48.2815552Z         D: int,
2025-05-07T20:33:48.2815768Z         scale_ub: Optional[float],
2025-05-07T20:33:48.2816045Z         contiguous: bool,
2025-05-07T20:33:48.2816286Z         compiled: bool,
2025-05-07T20:33:48.2816515Z     ) -> None:
2025-05-07T20:33:48.2816723Z         torch.manual_seed(2025)
2025-05-07T20:33:48.2816955Z     
2025-05-07T20:33:48.2817222Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:48.2817569Z     
2025-05-07T20:33:48.2817759Z         x_sign = torch.sign(x)
2025-05-07T20:33:48.2818049Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:48.2818357Z         x = x_sign * x_clamp
2025-05-07T20:33:48.2818589Z         x0 = x[:, :D]
2025-05-07T20:33:48.2818816Z         x1 = x[:, D:]
2025-05-07T20:33:48.2819020Z     
2025-05-07T20:33:48.2819205Z         if contiguous:
2025-05-07T20:33:48.2819462Z             x0 = x0.contiguous()
2025-05-07T20:33:48.2819812Z             x1 = x1.contiguous()
2025-05-07T20:33:48.2820060Z     
2025-05-07T20:33:48.2820249Z         if scale_ub is not None:
2025-05-07T20:33:48.2820516Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:48.2820851Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:48.2821158Z             )
2025-05-07T20:33:48.2821349Z         else:
2025-05-07T20:33:48.2821555Z             scale_ub_tensor = None
2025-05-07T20:33:48.2821809Z     
2025-05-07T20:33:48.2822049Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:48.2822363Z             op = silu_mul_quant
2025-05-07T20:33:48.2822609Z             if compiled:
2025-05-07T20:33:48.2822853Z                 op = torch.compile(op)
2025-05-07T20:33:48.2823144Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.2823419Z     
2025-05-07T20:33:48.2823605Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:48.2823766Z 
2025-05-07T20:33:48.2823863Z moe/activation_test.py:117: 
2025-05-07T20:33:48.2824164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.2824501Z moe/activation_test.py:115: in fn
2025-05-07T20:33:48.2824774Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.2825733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:48.2826465Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:48.2827021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:48.2827724Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:48.2828419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:48.2828972Z     kernel = self.compile(
2025-05-07T20:33:48.2829532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:48.2830206Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:48.2830613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.2830850Z 
2025-05-07T20:33:48.2831064Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c08397fb0>
2025-05-07T20:33:48.2832179Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:48.2833737Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c08402340>}
2025-05-07T20:33:48.2835143Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:48.2836297Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d0a9c30>
2025-05-07T20:33:48.2836596Z 
2025-05-07T20:33:48.2836770Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:48.2837301Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:48.2837781Z                            module_map=module_map)
2025-05-07T20:33:48.2838158Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:48.2838519Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:48.2838782Z E       ^
2025-05-07T20:33:48.2839264Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:48.2839735Z 
2025-05-07T20:33:48.2840235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:48.2840805Z 
2025-05-07T20:33:48.2840914Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:48.2841335Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:48.2841750Z     T=16384,
2025-05-07T20:33:48.2841943Z     D=5120,
2025-05-07T20:33:48.2842128Z     scale_ub=1200.0,
2025-05-07T20:33:48.2842341Z     contiguous=True,
2025-05-07T20:33:48.2842552Z     compiled=True,
2025-05-07T20:33:48.2842747Z )
2025-05-07T20:33:48.2843064Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:48.2843573Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:48.2843854Z 
2025-05-07T20:33:48.2843937Z     @given(
2025-05-07T20:33:48.2844157Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:48.2844476Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:48.2844784Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:48.2845108Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:48.2845441Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:48.2845730Z     )
2025-05-07T20:33:48.2846083Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:48.2846534Z     def test_silu_mul_quant(
2025-05-07T20:33:48.2846779Z         self,
2025-05-07T20:33:48.2846974Z         T: int,
2025-05-07T20:33:48.2847167Z         D: int,
2025-05-07T20:33:48.2847379Z         scale_ub: Optional[float],
2025-05-07T20:33:48.2847650Z         contiguous: bool,
2025-05-07T20:33:48.2847879Z         compiled: bool,
2025-05-07T20:33:48.2848095Z     ) -> None:
2025-05-07T20:33:48.2848304Z         torch.manual_seed(2025)
2025-05-07T20:33:48.2848540Z     
2025-05-07T20:33:48.2848815Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:48.2849183Z     
2025-05-07T20:33:48.2849393Z         x_sign = torch.sign(x)
2025-05-07T20:33:48.2849681Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:48.2849987Z         x = x_sign * x_clamp
2025-05-07T20:33:48.2850219Z         x0 = x[:, :D]
2025-05-07T20:33:48.2850438Z         x1 = x[:, D:]
2025-05-07T20:33:48.2850641Z     
2025-05-07T20:33:48.2850821Z         if contiguous:
2025-05-07T20:33:48.2851042Z             x0 = x0.contiguous()
2025-05-07T20:33:48.2851293Z             x1 = x1.contiguous()
2025-05-07T20:33:48.2851526Z     
2025-05-07T20:33:48.2851706Z         if scale_ub is not None:
2025-05-07T20:33:48.2852027Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:48.2852358Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:48.2852665Z             )
2025-05-07T20:33:48.2852865Z         else:
2025-05-07T20:33:48.2853074Z             scale_ub_tensor = None
2025-05-07T20:33:48.2853363Z     
2025-05-07T20:33:48.2853591Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:48.2853905Z             op = silu_mul_quant
2025-05-07T20:33:48.2854148Z             if compiled:
2025-05-07T20:33:48.2854391Z                 op = torch.compile(op)
2025-05-07T20:33:48.2854876Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.2855151Z     
2025-05-07T20:33:48.2855348Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:48.2855516Z 
2025-05-07T20:33:48.2855615Z moe/activation_test.py:117: 
2025-05-07T20:33:48.2855918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.2856256Z moe/activation_test.py:115: in fn
2025-05-07T20:33:48.2856548Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.2857119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:48.2857694Z     return fn(*args, **kwargs)
2025-05-07T20:33:48.2858379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:48.2859101Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:48.2859705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:48.2860428Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:48.2861124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:48.2861686Z     kernel = self.compile(
2025-05-07T20:33:48.2862245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:48.2862934Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:48.2863344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.2863578Z 
2025-05-07T20:33:48.2863800Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c08395e80>
2025-05-07T20:33:48.2864921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:48.2866355Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c084039c0>}
2025-05-07T20:33:48.2867760Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:48.2868842Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d4bfdb0>
2025-05-07T20:33:48.2869142Z 
2025-05-07T20:33:48.2869315Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:48.2869849Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:48.2870341Z                            module_map=module_map)
2025-05-07T20:33:48.2870712Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:48.2871084Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:48.2871348Z E       ^
2025-05-07T20:33:48.2871826Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:48.2872300Z 
2025-05-07T20:33:48.2872736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:48.2873332Z 
2025-05-07T20:33:48.4570482Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:48.4581773Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:48.4582207Z     T=16384,
2025-05-07T20:33:48.4582522Z     D=5120,
2025-05-07T20:33:48.4582723Z     scale_ub=None,
2025-05-07T20:33:48.4582945Z     contiguous=False,
2025-05-07T20:33:48.4583178Z     compiled=True,
2025-05-07T20:33:48.4583386Z )
2025-05-07T20:33:48.4583715Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:48.4584282Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:48.4584572Z 
2025-05-07T20:33:48.4584646Z     @given(
2025-05-07T20:33:48.4584867Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:48.4585175Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:48.4585484Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:48.4585817Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:48.4586141Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:48.4586423Z     )
2025-05-07T20:33:48.4586770Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:48.4587225Z     def test_silu_mul_quant(
2025-05-07T20:33:48.4587458Z         self,
2025-05-07T20:33:48.4587648Z         T: int,
2025-05-07T20:33:48.4587838Z         D: int,
2025-05-07T20:33:48.4588110Z         scale_ub: Optional[float],
2025-05-07T20:33:48.4588391Z         contiguous: bool,
2025-05-07T20:33:48.4588630Z         compiled: bool,
2025-05-07T20:33:48.4588850Z     ) -> None:
2025-05-07T20:33:48.4589069Z         torch.manual_seed(2025)
2025-05-07T20:33:48.4589315Z     
2025-05-07T20:33:48.4589583Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:48.4589927Z     
2025-05-07T20:33:48.4590117Z         x_sign = torch.sign(x)
2025-05-07T20:33:48.4590407Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:48.4590719Z         x = x_sign * x_clamp
2025-05-07T20:33:48.4590957Z         x0 = x[:, :D]
2025-05-07T20:33:48.4591169Z         x1 = x[:, D:]
2025-05-07T20:33:48.4591368Z     
2025-05-07T20:33:48.4591549Z         if contiguous:
2025-05-07T20:33:48.4591777Z             x0 = x0.contiguous()
2025-05-07T20:33:48.4592030Z             x1 = x1.contiguous()
2025-05-07T20:33:48.4592262Z     
2025-05-07T20:33:48.4592450Z         if scale_ub is not None:
2025-05-07T20:33:48.4592715Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:48.4593058Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:48.4593372Z             )
2025-05-07T20:33:48.4593554Z         else:
2025-05-07T20:33:48.4593762Z             scale_ub_tensor = None
2025-05-07T20:33:48.4594020Z     
2025-05-07T20:33:48.4594247Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:48.4594567Z             op = silu_mul_quant
2025-05-07T20:33:48.4594816Z             if compiled:
2025-05-07T20:33:48.4595056Z                 op = torch.compile(op)
2025-05-07T20:33:48.4595359Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.4595646Z     
2025-05-07T20:33:48.4595841Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:48.4596008Z 
2025-05-07T20:33:48.4596110Z moe/activation_test.py:117: 
2025-05-07T20:33:48.4596407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.4596753Z moe/activation_test.py:115: in fn
2025-05-07T20:33:48.4597038Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.4597616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:48.4598196Z     return fn(*args, **kwargs)
2025-05-07T20:33:48.4598866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:48.4599711Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:48.4600268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:48.4600977Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:48.4601702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:48.4602261Z     kernel = self.compile(
2025-05-07T20:33:48.4602826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:48.4603549Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:48.4603951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.4604189Z 
2025-05-07T20:33:48.4604402Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d47c8f0>
2025-05-07T20:33:48.4605530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:48.4606965Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d1a0c20>}
2025-05-07T20:33:48.4608410Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:48.4609549Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d1940b0>
2025-05-07T20:33:48.4609856Z 
2025-05-07T20:33:48.4610022Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:48.4610556Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:48.4611033Z                            module_map=module_map)
2025-05-07T20:33:48.4611411Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:48.4611777Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:48.4612046Z E       ^
2025-05-07T20:33:48.4612524Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:48.4612999Z 
2025-05-07T20:33:48.4613438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:48.4613980Z 
2025-05-07T20:33:48.4614090Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:48.4614571Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:48.4614993Z     T=2048,
2025-05-07T20:33:48.4615183Z     D=5120,
2025-05-07T20:33:48.4615380Z     scale_ub=None,
2025-05-07T20:33:48.4615593Z     contiguous=False,
2025-05-07T20:33:48.4615827Z     compiled=True,
2025-05-07T20:33:48.4616037Z )
2025-05-07T20:33:48.5508085Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:48.5508862Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:48.5509691Z 
2025-05-07T20:33:48.5509902Z     @given(
2025-05-07T20:33:48.5510536Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:48.5511396Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:48.5512011Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:48.5512666Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:48.5513323Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:48.5513890Z     )
2025-05-07T20:33:48.5514583Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:48.5515486Z     def test_silu_mul_quant(
2025-05-07T20:33:48.5515960Z         self,
2025-05-07T20:33:48.5516331Z         T: int,
2025-05-07T20:33:48.5516922Z         D: int,
2025-05-07T20:33:48.5517332Z         scale_ub: Optional[float],
2025-05-07T20:33:48.5517880Z         contiguous: bool,
2025-05-07T20:33:48.5518346Z         compiled: bool,
2025-05-07T20:33:48.5518766Z     ) -> None:
2025-05-07T20:33:48.5519078Z         torch.manual_seed(2025)
2025-05-07T20:33:48.5519384Z     
2025-05-07T20:33:48.5519657Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:48.5520006Z     
2025-05-07T20:33:48.5520203Z         x_sign = torch.sign(x)
2025-05-07T20:33:48.5520482Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:48.5520861Z         x = x_sign * x_clamp
2025-05-07T20:33:48.5521093Z         x0 = x[:, :D]
2025-05-07T20:33:48.5521307Z         x1 = x[:, D:]
2025-05-07T20:33:48.5521506Z     
2025-05-07T20:33:48.5521684Z         if contiguous:
2025-05-07T20:33:48.5521915Z             x0 = x0.contiguous()
2025-05-07T20:33:48.5522178Z             x1 = x1.contiguous()
2025-05-07T20:33:48.5522416Z     
2025-05-07T20:33:48.5522602Z         if scale_ub is not None:
2025-05-07T20:33:48.5522873Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:48.5523209Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:48.5523519Z             )
2025-05-07T20:33:48.5523702Z         else:
2025-05-07T20:33:48.5523912Z             scale_ub_tensor = None
2025-05-07T20:33:48.5524168Z     
2025-05-07T20:33:48.5524391Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:48.5524775Z             op = silu_mul_quant
2025-05-07T20:33:48.5525029Z             if compiled:
2025-05-07T20:33:48.5525276Z                 op = torch.compile(op)
2025-05-07T20:33:48.5525909Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.5526195Z     
2025-05-07T20:33:48.5526389Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:48.5526558Z 
2025-05-07T20:33:48.5526656Z moe/activation_test.py:117: 
2025-05-07T20:33:48.5526961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.5527308Z moe/activation_test.py:115: in fn
2025-05-07T20:33:48.5527583Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.5528166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:48.5528753Z     return fn(*args, **kwargs)
2025-05-07T20:33:48.5529439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:48.5530165Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:48.5530729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:48.5531447Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:48.5532131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:48.5532687Z     kernel = self.compile(
2025-05-07T20:33:48.5533250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:48.5533937Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:48.5534340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.5534651Z 
2025-05-07T20:33:48.5534863Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d47e360>
2025-05-07T20:33:48.5535986Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:48.5537414Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d1a19e0>}
2025-05-07T20:33:48.5538827Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:48.5540020Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d15e230>
2025-05-07T20:33:48.5540322Z 
2025-05-07T20:33:48.5540551Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:48.5541088Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:48.5541567Z                            module_map=module_map)
2025-05-07T20:33:48.5541993Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:48.5542356Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:48.5542623Z E       ^
2025-05-07T20:33:48.5543115Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:48.5543592Z 
2025-05-07T20:33:48.5544038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:48.5544596Z 
2025-05-07T20:33:48.5544754Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:48.5545214Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:48.5545636Z     T=2048,
2025-05-07T20:33:48.5545827Z     D=5120,
2025-05-07T20:33:48.5546011Z     scale_ub=1200.0,
2025-05-07T20:33:48.5546232Z     contiguous=False,
2025-05-07T20:33:48.5546577Z     compiled=True,
2025-05-07T20:33:48.5546779Z )
2025-05-07T20:33:48.5547099Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:48.5547617Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:48.5547904Z 
2025-05-07T20:33:48.5547984Z     @given(
2025-05-07T20:33:48.5548210Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:48.5548519Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:48.5548827Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:48.5549162Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:48.5549490Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:48.5549781Z     )
2025-05-07T20:33:48.5550137Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:48.5550593Z     def test_silu_mul_quant(
2025-05-07T20:33:48.5550829Z         self,
2025-05-07T20:33:48.5551021Z         T: int,
2025-05-07T20:33:48.5551210Z         D: int,
2025-05-07T20:33:48.5551424Z         scale_ub: Optional[float],
2025-05-07T20:33:48.5551695Z         contiguous: bool,
2025-05-07T20:33:48.5551932Z         compiled: bool,
2025-05-07T20:33:48.5552142Z     ) -> None:
2025-05-07T20:33:48.5552346Z         torch.manual_seed(2025)
2025-05-07T20:33:48.5552577Z     
2025-05-07T20:33:48.5552848Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:48.5553196Z     
2025-05-07T20:33:48.5553389Z         x_sign = torch.sign(x)
2025-05-07T20:33:48.5553675Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:48.5553989Z         x = x_sign * x_clamp
2025-05-07T20:33:48.5554230Z         x0 = x[:, :D]
2025-05-07T20:33:48.5554455Z         x1 = x[:, D:]
2025-05-07T20:33:48.5554659Z     
2025-05-07T20:33:48.5554852Z         if contiguous:
2025-05-07T20:33:48.5555078Z             x0 = x0.contiguous()
2025-05-07T20:33:48.5555332Z             x1 = x1.contiguous()
2025-05-07T20:33:48.5555582Z     
2025-05-07T20:33:48.5555776Z         if scale_ub is not None:
2025-05-07T20:33:48.5556051Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:48.5556389Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:48.5556707Z             )
2025-05-07T20:33:48.5556886Z         else:
2025-05-07T20:33:48.5557090Z             scale_ub_tensor = None
2025-05-07T20:33:48.5557343Z     
2025-05-07T20:33:48.5557574Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:48.5557939Z             op = silu_mul_quant
2025-05-07T20:33:48.5558184Z             if compiled:
2025-05-07T20:33:48.5558424Z                 op = torch.compile(op)
2025-05-07T20:33:48.5558717Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.5558998Z     
2025-05-07T20:33:48.5559225Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:48.5559391Z 
2025-05-07T20:33:48.5559486Z moe/activation_test.py:117: 
2025-05-07T20:33:48.5559784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.5560161Z moe/activation_test.py:115: in fn
2025-05-07T20:33:48.5560434Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.5561005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:48.5561593Z     return fn(*args, **kwargs)
2025-05-07T20:33:48.5562276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:48.5562995Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:48.5563547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:48.5564261Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:48.5564951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:48.5565542Z     kernel = self.compile(
2025-05-07T20:33:48.5566105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:48.5566796Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:48.5567196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.5567439Z 
2025-05-07T20:33:48.5567650Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d47e000>
2025-05-07T20:33:48.5568771Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:48.5570197Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d1a2b60>}
2025-05-07T20:33:48.5571604Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:48.5572681Z context = <triton._C.libtriton.ir.context object at 0x7f1b1dd13230>
2025-05-07T20:33:48.5572987Z 
2025-05-07T20:33:48.5573154Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:48.5573690Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:48.5574167Z                            module_map=module_map)
2025-05-07T20:33:48.5574643Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:48.5575010Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:48.5575274Z E       ^
2025-05-07T20:33:48.5575748Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:48.5576224Z 
2025-05-07T20:33:48.5576660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:48.5577211Z 
2025-05-07T20:33:48.7315496Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:48.7316137Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:48.7316698Z     T=4096,
2025-05-07T20:33:48.7316962Z     D=5120,
2025-05-07T20:33:48.7317248Z     scale_ub=1200.0,
2025-05-07T20:33:48.7317709Z     contiguous=True,
2025-05-07T20:33:48.7317968Z     compiled=True,
2025-05-07T20:33:48.7318178Z )
2025-05-07T20:33:48.7318506Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:48.7319012Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:48.7319372Z 
2025-05-07T20:33:48.7319454Z     @given(
2025-05-07T20:33:48.7319686Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:48.7320008Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:48.7320311Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:48.7320709Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:48.7321042Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:48.7321331Z     )
2025-05-07T20:33:48.7321678Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:48.7322129Z     def test_silu_mul_quant(
2025-05-07T20:33:48.7322365Z         self,
2025-05-07T20:33:48.7322570Z         T: int,
2025-05-07T20:33:48.7322760Z         D: int,
2025-05-07T20:33:48.7322969Z         scale_ub: Optional[float],
2025-05-07T20:33:48.7323242Z         contiguous: bool,
2025-05-07T20:33:48.7323480Z         compiled: bool,
2025-05-07T20:33:48.7323692Z     ) -> None:
2025-05-07T20:33:48.7323909Z         torch.manual_seed(2025)
2025-05-07T20:33:48.7324151Z     
2025-05-07T20:33:48.7324435Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:48.7324851Z     
2025-05-07T20:33:48.7325042Z         x_sign = torch.sign(x)
2025-05-07T20:33:48.7325334Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:48.7325883Z         x = x_sign * x_clamp
2025-05-07T20:33:48.7326118Z         x0 = x[:, :D]
2025-05-07T20:33:48.7326330Z         x1 = x[:, D:]
2025-05-07T20:33:48.7326534Z     
2025-05-07T20:33:48.7326712Z         if contiguous:
2025-05-07T20:33:48.7326938Z             x0 = x0.contiguous()
2025-05-07T20:33:48.7327195Z             x1 = x1.contiguous()
2025-05-07T20:33:48.7327436Z     
2025-05-07T20:33:48.7327626Z         if scale_ub is not None:
2025-05-07T20:33:48.7327888Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:48.7328222Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:48.7328531Z             )
2025-05-07T20:33:48.7328718Z         else:
2025-05-07T20:33:48.7328922Z             scale_ub_tensor = None
2025-05-07T20:33:48.7329174Z     
2025-05-07T20:33:48.7329407Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:48.7329730Z             op = silu_mul_quant
2025-05-07T20:33:48.7329976Z             if compiled:
2025-05-07T20:33:48.7330220Z                 op = torch.compile(op)
2025-05-07T20:33:48.7330513Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.7330783Z     
2025-05-07T20:33:48.7330970Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:48.7331131Z 
2025-05-07T20:33:48.7331228Z moe/activation_test.py:117: 
2025-05-07T20:33:48.7331526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.7331862Z moe/activation_test.py:115: in fn
2025-05-07T20:33:48.7332140Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:48.7332722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:48.7333303Z     return fn(*args, **kwargs)
2025-05-07T20:33:48.7333986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:48.7334812Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:48.7335367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:48.7336078Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:48.7336764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:48.7337406Z     kernel = self.compile(
2025-05-07T20:33:48.7337976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:48.7338668Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:48.7339129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:48.7339369Z 
2025-05-07T20:33:48.7339605Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d007920>
2025-05-07T20:33:48.7340755Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:48.7342247Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d054180>}
2025-05-07T20:33:48.7343664Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:48.7344743Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d0ff470>
2025-05-07T20:33:48.7345047Z 
2025-05-07T20:33:48.7345218Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:48.7345812Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:48.7346299Z                            module_map=module_map)
2025-05-07T20:33:48.7346667Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:48.7347029Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:48.7347295Z E       ^
2025-05-07T20:33:48.7347764Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:48.7348242Z 
2025-05-07T20:33:48.7348676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:48.7349213Z 
2025-05-07T20:33:48.7349318Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:48.7349741Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:48.7350150Z     T=128,
2025-05-07T20:33:48.7350337Z     D=5120,
2025-05-07T20:33:48.7350527Z     scale_ub=1200.0,
2025-05-07T20:33:48.7350741Z     contiguous=False,
2025-05-07T20:33:48.7350962Z     compiled=True,
2025-05-07T20:33:48.7351156Z )
2025-05-07T20:33:49.0299064Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.0299742Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:49.0300224Z 
2025-05-07T20:33:49.0300338Z     @given(
2025-05-07T20:33:49.0300664Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.0301106Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.0301520Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.0301851Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.0302180Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.0302467Z     )
2025-05-07T20:33:49.0302816Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.0303262Z     def test_silu_mul_quant(
2025-05-07T20:33:49.0303500Z         self,
2025-05-07T20:33:49.0303690Z         T: int,
2025-05-07T20:33:49.0303885Z         D: int,
2025-05-07T20:33:49.0304101Z         scale_ub: Optional[float],
2025-05-07T20:33:49.0304380Z         contiguous: bool,
2025-05-07T20:33:49.0304621Z         compiled: bool,
2025-05-07T20:33:49.0304848Z     ) -> None:
2025-05-07T20:33:49.0305051Z         torch.manual_seed(2025)
2025-05-07T20:33:49.0305286Z     
2025-05-07T20:33:49.0305559Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.0306030Z     
2025-05-07T20:33:49.0306222Z         x_sign = torch.sign(x)
2025-05-07T20:33:49.0306515Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:49.0306816Z         x = x_sign * x_clamp
2025-05-07T20:33:49.0307056Z         x0 = x[:, :D]
2025-05-07T20:33:49.0307333Z         x1 = x[:, D:]
2025-05-07T20:33:49.0307535Z     
2025-05-07T20:33:49.0307725Z         if contiguous:
2025-05-07T20:33:49.0314275Z             x0 = x0.contiguous()
2025-05-07T20:33:49.0314570Z             x1 = x1.contiguous()
2025-05-07T20:33:49.0314924Z     
2025-05-07T20:33:49.0315114Z         if scale_ub is not None:
2025-05-07T20:33:49.0315393Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:49.0315740Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:49.0316065Z             )
2025-05-07T20:33:49.0316255Z         else:
2025-05-07T20:33:49.0316464Z             scale_ub_tensor = None
2025-05-07T20:33:49.0316723Z     
2025-05-07T20:33:49.0316951Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:49.0317280Z             op = silu_mul_quant
2025-05-07T20:33:49.0317536Z             if compiled:
2025-05-07T20:33:49.0317786Z                 op = torch.compile(op)
2025-05-07T20:33:49.0318090Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.0318377Z     
2025-05-07T20:33:49.0318567Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:49.0318740Z 
2025-05-07T20:33:49.0318960Z moe/activation_test.py:117: 
2025-05-07T20:33:49.0319264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.0319604Z moe/activation_test.py:115: in fn
2025-05-07T20:33:49.0319891Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.0320477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:49.0321070Z     return fn(*args, **kwargs)
2025-05-07T20:33:49.0321752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:49.0322479Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:49.0323045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:49.0323757Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:49.0324447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:49.0325009Z     kernel = self.compile(
2025-05-07T20:33:49.0325920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:49.0326610Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:49.0327012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.0327247Z 
2025-05-07T20:33:49.0327460Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d005790>
2025-05-07T20:33:49.0328584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:49.0330007Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d054ea0>}
2025-05-07T20:33:49.0331418Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:49.0332500Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d0d3430>
2025-05-07T20:33:49.0332797Z 
2025-05-07T20:33:49.0332970Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:49.0333596Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:49.0334072Z                            module_map=module_map)
2025-05-07T20:33:49.0334440Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:49.0334982Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:49.0335238Z E       ^
2025-05-07T20:33:49.0335717Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:49.0336189Z 
2025-05-07T20:33:49.0336696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:49.0337243Z 
2025-05-07T20:33:49.0337352Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.0337770Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.0338194Z     T=16384,
2025-05-07T20:33:49.0338402Z     D=7168,
2025-05-07T20:33:49.0338597Z     scale_ub=1200.0,
2025-05-07T20:33:49.0338820Z     contiguous=True,
2025-05-07T20:33:49.0339042Z     compiled=True,
2025-05-07T20:33:49.0339245Z )
2025-05-07T20:33:49.0339593Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.0340140Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:49.0340429Z 
2025-05-07T20:33:49.0340512Z     @given(
2025-05-07T20:33:49.0340803Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.0341125Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.0341440Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.0341771Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.0342165Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.0342520Z     )
2025-05-07T20:33:49.0342872Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.0343415Z     def test_silu_mul_quant(
2025-05-07T20:33:49.0343657Z         self,
2025-05-07T20:33:49.0343869Z         T: int,
2025-05-07T20:33:49.0344124Z         D: int,
2025-05-07T20:33:49.0344344Z         scale_ub: Optional[float],
2025-05-07T20:33:49.0344618Z         contiguous: bool,
2025-05-07T20:33:49.0344944Z         compiled: bool,
2025-05-07T20:33:49.0345194Z     ) -> None:
2025-05-07T20:33:49.0345417Z         torch.manual_seed(2025)
2025-05-07T20:33:49.0345716Z     
2025-05-07T20:33:49.0346031Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.0346382Z     
2025-05-07T20:33:49.0346572Z         x_sign = torch.sign(x)
2025-05-07T20:33:49.0346873Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:49.0347194Z         x = x_sign * x_clamp
2025-05-07T20:33:49.0347437Z         x0 = x[:, :D]
2025-05-07T20:33:49.0347656Z         x1 = x[:, D:]
2025-05-07T20:33:49.0347871Z     
2025-05-07T20:33:49.0348051Z         if contiguous:
2025-05-07T20:33:49.0348287Z             x0 = x0.contiguous()
2025-05-07T20:33:49.0348555Z             x1 = x1.contiguous()
2025-05-07T20:33:49.0348803Z     
2025-05-07T20:33:49.0349001Z         if scale_ub is not None:
2025-05-07T20:33:49.0349279Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:49.0349625Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:49.0349936Z             )
2025-05-07T20:33:49.0350141Z         else:
2025-05-07T20:33:49.0350362Z             scale_ub_tensor = None
2025-05-07T20:33:49.0350622Z     
2025-05-07T20:33:49.0350867Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:49.0351199Z             op = silu_mul_quant
2025-05-07T20:33:49.0351447Z             if compiled:
2025-05-07T20:33:49.0351694Z                 op = torch.compile(op)
2025-05-07T20:33:49.0351998Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.0352269Z     
2025-05-07T20:33:49.0352460Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:49.0352699Z 
2025-05-07T20:33:49.0352799Z moe/activation_test.py:117: 
2025-05-07T20:33:49.0353096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.0353429Z moe/activation_test.py:115: in fn
2025-05-07T20:33:49.0353716Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.0354339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:49.0354931Z     return fn(*args, **kwargs)
2025-05-07T20:33:49.0355623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:49.0356396Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:49.0356967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:49.0357683Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:49.0358384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:49.0358951Z     kernel = self.compile(
2025-05-07T20:33:49.0359557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:49.0360264Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:49.0360677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.0360954Z 
2025-05-07T20:33:49.0361172Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d005280>
2025-05-07T20:33:49.0362295Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:49.0363718Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d0560c0>}
2025-05-07T20:33:49.0365133Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:49.0366222Z context = <triton._C.libtriton.ir.context object at 0x7f1b1cbc1430>
2025-05-07T20:33:49.0366522Z 
2025-05-07T20:33:49.0366698Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:49.0367239Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:49.0367728Z                            module_map=module_map)
2025-05-07T20:33:49.0368103Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:49.0368461Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:49.0368730Z E       ^
2025-05-07T20:33:49.0369213Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:49.0369687Z 
2025-05-07T20:33:49.0370181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:49.0370725Z 
2025-05-07T20:33:49.1591347Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.1592036Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.1592626Z     T=16384,
2025-05-07T20:33:49.1592898Z     D=5120,
2025-05-07T20:33:49.1593159Z     scale_ub=1200.0,
2025-05-07T20:33:49.1593466Z     contiguous=True,
2025-05-07T20:33:49.1593728Z     compiled=False,
2025-05-07T20:33:49.1593929Z )
2025-05-07T20:33:49.1594255Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.1594783Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:49.1595072Z 
2025-05-07T20:33:49.1595149Z     @given(
2025-05-07T20:33:49.1595499Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.1595823Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.1596146Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.1596478Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.1596883Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.1597180Z     )
2025-05-07T20:33:49.1597527Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.1597979Z     def test_silu_mul_quant(
2025-05-07T20:33:49.1598284Z         self,
2025-05-07T20:33:49.1598467Z         T: int,
2025-05-07T20:33:49.1598654Z         D: int,
2025-05-07T20:33:49.1598869Z         scale_ub: Optional[float],
2025-05-07T20:33:49.1599134Z         contiguous: bool,
2025-05-07T20:33:49.1599392Z         compiled: bool,
2025-05-07T20:33:49.1599634Z     ) -> None:
2025-05-07T20:33:49.1599838Z         torch.manual_seed(2025)
2025-05-07T20:33:49.1600080Z     
2025-05-07T20:33:49.1600349Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.1600694Z     
2025-05-07T20:33:49.1600881Z         x_sign = torch.sign(x)
2025-05-07T20:33:49.1601166Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:49.1601478Z         x = x_sign * x_clamp
2025-05-07T20:33:49.1601709Z         x0 = x[:, :D]
2025-05-07T20:33:49.1601924Z         x1 = x[:, D:]
2025-05-07T20:33:49.1602127Z     
2025-05-07T20:33:49.1602367Z         if contiguous:
2025-05-07T20:33:49.1602602Z             x0 = x0.contiguous()
2025-05-07T20:33:49.1602867Z             x1 = x1.contiguous()
2025-05-07T20:33:49.1603104Z     
2025-05-07T20:33:49.1603293Z         if scale_ub is not None:
2025-05-07T20:33:49.1603562Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:49.1603890Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:49.1604202Z             )
2025-05-07T20:33:49.1604388Z         else:
2025-05-07T20:33:49.1604593Z             scale_ub_tensor = None
2025-05-07T20:33:49.1604842Z     
2025-05-07T20:33:49.1605070Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:49.1605389Z             op = silu_mul_quant
2025-05-07T20:33:49.1605628Z             if compiled:
2025-05-07T20:33:49.1605899Z                 op = torch.compile(op)
2025-05-07T20:33:49.1606199Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.1606472Z     
2025-05-07T20:33:49.1606661Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:49.1606831Z 
2025-05-07T20:33:49.1606924Z moe/activation_test.py:117: 
2025-05-07T20:33:49.1607231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.1607568Z moe/activation_test.py:115: in fn
2025-05-07T20:33:49.1607844Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.1608562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:49.1609289Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:49.1609842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
﻿2025-05-07T20:33:49.1613226Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:49.1613925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:49.1614482Z     kernel = self.compile(
2025-05-07T20:33:49.1615207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:49.1615899Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:49.1616311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.1616551Z 
2025-05-07T20:33:49.1616764Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c08397320>
2025-05-07T20:33:49.1617881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:49.1619391Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d055a80>}
2025-05-07T20:33:49.1620797Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:49.1621925Z context = <triton._C.libtriton.ir.context object at 0x7f1b1cf93bb0>
2025-05-07T20:33:49.1622230Z 
2025-05-07T20:33:49.1622399Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:49.1622945Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:49.1623428Z                            module_map=module_map)
2025-05-07T20:33:49.1623802Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:49.1624169Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:49.1624432Z E       ^
2025-05-07T20:33:49.1624919Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:49.1625666Z 
2025-05-07T20:33:49.1626185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:49.1626735Z 
2025-05-07T20:33:49.1626846Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.1627269Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.1627691Z     T=1,
2025-05-07T20:33:49.1627881Z     D=7168,
2025-05-07T20:33:49.1628077Z     scale_ub=1200.0,
2025-05-07T20:33:49.1628303Z     contiguous=False,
2025-05-07T20:33:49.1628532Z     compiled=False,
2025-05-07T20:33:49.1628748Z )
2025-05-07T20:33:49.1629072Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.1629581Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:49.1629861Z 
2025-05-07T20:33:49.1629942Z     @given(
2025-05-07T20:33:49.1630175Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.1630493Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.1630817Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.1631158Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.1631493Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.1631791Z     )
2025-05-07T20:33:49.1632146Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.1632602Z     def test_silu_mul_quant(
2025-05-07T20:33:49.1632848Z         self,
2025-05-07T20:33:49.1633046Z         T: int,
2025-05-07T20:33:49.1633238Z         D: int,
2025-05-07T20:33:49.1633452Z         scale_ub: Optional[float],
2025-05-07T20:33:49.1633721Z         contiguous: bool,
2025-05-07T20:33:49.1633956Z         compiled: bool,
2025-05-07T20:33:49.1634290Z     ) -> None:
2025-05-07T20:33:49.1634502Z         torch.manual_seed(2025)
2025-05-07T20:33:49.1634739Z     
2025-05-07T20:33:49.1635013Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.1635361Z     
2025-05-07T20:33:49.1635552Z         x_sign = torch.sign(x)
2025-05-07T20:33:49.1635843Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:49.1636156Z         x = x_sign * x_clamp
2025-05-07T20:33:49.1636389Z         x0 = x[:, :D]
2025-05-07T20:33:49.1636602Z         x1 = x[:, D:]
2025-05-07T20:33:49.1636807Z     
2025-05-07T20:33:49.1636987Z         if contiguous:
2025-05-07T20:33:49.1637219Z             x0 = x0.contiguous()
2025-05-07T20:33:49.1637472Z             x1 = x1.contiguous()
2025-05-07T20:33:49.1637717Z     
2025-05-07T20:33:49.1637902Z         if scale_ub is not None:
2025-05-07T20:33:49.1638176Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:49.1638519Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:49.1638833Z             )
2025-05-07T20:33:49.1639022Z         else:
2025-05-07T20:33:49.1639300Z             scale_ub_tensor = None
2025-05-07T20:33:49.1639550Z     
2025-05-07T20:33:49.1639779Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:49.1640101Z             op = silu_mul_quant
2025-05-07T20:33:49.1640408Z             if compiled:
2025-05-07T20:33:49.1640661Z                 op = torch.compile(op)
2025-05-07T20:33:49.1640966Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.1641238Z     
2025-05-07T20:33:49.1641433Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:49.1641597Z 
2025-05-07T20:33:49.1641699Z moe/activation_test.py:117: 
2025-05-07T20:33:49.1641997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.1642333Z moe/activation_test.py:115: in fn
2025-05-07T20:33:49.1642616Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.1643339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:49.1644059Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:49.1644663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:49.1645386Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:49.1646091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:49.1646648Z     kernel = self.compile(
2025-05-07T20:33:49.1647216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:49.1647911Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:49.1648320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.1648568Z 
2025-05-07T20:33:49.1648788Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1cf6a060>
2025-05-07T20:33:49.1649976Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:49.1651407Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1cdfc0e0>}
2025-05-07T20:33:49.1652817Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:49.1653900Z context = <triton._C.libtriton.ir.context object at 0x7f1b1cda19f0>
2025-05-07T20:33:49.1654207Z 
2025-05-07T20:33:49.1654379Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:49.1655076Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:49.1655567Z                            module_map=module_map)
2025-05-07T20:33:49.1655939Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:49.1656304Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:49.1656575Z E       ^
2025-05-07T20:33:49.1657046Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:49.1657525Z 
2025-05-07T20:33:49.1657961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:49.1658511Z 
2025-05-07T20:33:49.3383735Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.3384350Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.3384927Z     T=4096,
2025-05-07T20:33:49.3385188Z     D=7168,
2025-05-07T20:33:49.3385477Z     scale_ub=1200.0,
2025-05-07T20:33:49.3385712Z     contiguous=False,
2025-05-07T20:33:49.3386065Z     compiled=True,
2025-05-07T20:33:49.3386273Z )
2025-05-07T20:33:49.3386590Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.3387100Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:49.3387460Z 
2025-05-07T20:33:49.3387537Z     @given(
2025-05-07T20:33:49.3387765Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.3388079Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.3388389Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.3388719Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.3389044Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.3389354Z     )
2025-05-07T20:33:49.3389716Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.3390185Z     def test_silu_mul_quant(
2025-05-07T20:33:49.3390432Z         self,
2025-05-07T20:33:49.3390643Z         T: int,
2025-05-07T20:33:49.3390849Z         D: int,
2025-05-07T20:33:49.3391070Z         scale_ub: Optional[float],
2025-05-07T20:33:49.3391352Z         contiguous: bool,
2025-05-07T20:33:49.3391666Z         compiled: bool,
2025-05-07T20:33:49.3391890Z     ) -> None:
2025-05-07T20:33:49.3392105Z         torch.manual_seed(2025)
2025-05-07T20:33:49.3392345Z     
2025-05-07T20:33:49.3392621Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.3392973Z     
2025-05-07T20:33:49.3393167Z         x_sign = torch.sign(x)
2025-05-07T20:33:49.3393457Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:49.3393769Z         x = x_sign * x_clamp
2025-05-07T20:33:49.3394010Z         x0 = x[:, :D]
2025-05-07T20:33:49.3394223Z         x1 = x[:, D:]
2025-05-07T20:33:49.3394429Z     
2025-05-07T20:33:49.3394614Z         if contiguous:
2025-05-07T20:33:49.3394854Z             x0 = x0.contiguous()
2025-05-07T20:33:49.3395111Z             x1 = x1.contiguous()
2025-05-07T20:33:49.3395358Z     
2025-05-07T20:33:49.3395552Z         if scale_ub is not None:
2025-05-07T20:33:49.3395823Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:49.3396168Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:49.3396490Z             )
2025-05-07T20:33:49.3396676Z         else:
2025-05-07T20:33:49.3396887Z             scale_ub_tensor = None
2025-05-07T20:33:49.3397135Z     
2025-05-07T20:33:49.3397362Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:49.3397684Z             op = silu_mul_quant
2025-05-07T20:33:49.3397939Z             if compiled:
2025-05-07T20:33:49.3398194Z                 op = torch.compile(op)
2025-05-07T20:33:49.3398494Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.3398778Z     
2025-05-07T20:33:49.3398969Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:49.3399135Z 
2025-05-07T20:33:49.3399309Z moe/activation_test.py:117: 
2025-05-07T20:33:49.3399614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.3399954Z moe/activation_test.py:115: in fn
2025-05-07T20:33:49.3400232Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.3400817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:49.3401405Z     return fn(*args, **kwargs)
2025-05-07T20:33:49.3402084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:49.3402799Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:49.3403357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:49.3404066Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:49.3404753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:49.3405364Z     kernel = self.compile(
2025-05-07T20:33:49.3405926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:49.3406611Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:49.3407060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.3407301Z 
2025-05-07T20:33:49.3407510Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1cf694c0>
2025-05-07T20:33:49.3408634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:49.3410063Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1cdfd300>}
2025-05-07T20:33:49.3411551Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:49.3412640Z context = <triton._C.libtriton.ir.context object at 0x7f1b1cd274f0>
2025-05-07T20:33:49.3412944Z 
2025-05-07T20:33:49.3413116Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:49.3413659Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:49.3414152Z                            module_map=module_map)
2025-05-07T20:33:49.3420592Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:49.3420965Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:49.3421230Z E       ^
2025-05-07T20:33:49.3421705Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:49.3422203Z 
2025-05-07T20:33:49.3422651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:49.3423205Z 
2025-05-07T20:33:49.3423321Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.3423759Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.3424177Z     T=128,
2025-05-07T20:33:49.3424370Z     D=7168,
2025-05-07T20:33:49.3424562Z     scale_ub=1200.0,
2025-05-07T20:33:49.3424782Z     contiguous=False,
2025-05-07T20:33:49.3425005Z     compiled=True,
2025-05-07T20:33:49.3425210Z )
2025-05-07T20:33:49.4336287Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.4336894Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:49.4337257Z 
2025-05-07T20:33:49.4337368Z     @given(
2025-05-07T20:33:49.4337794Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.4338113Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.4338423Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.4338759Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.4339093Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.4339394Z     )
2025-05-07T20:33:49.4339747Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.4340204Z     def test_silu_mul_quant(
2025-05-07T20:33:49.4340443Z         self,
2025-05-07T20:33:49.4340636Z         T: int,
2025-05-07T20:33:49.4340830Z         D: int,
2025-05-07T20:33:49.4341044Z         scale_ub: Optional[float],
2025-05-07T20:33:49.4341319Z         contiguous: bool,
2025-05-07T20:33:49.4341562Z         compiled: bool,
2025-05-07T20:33:49.4341779Z     ) -> None:
2025-05-07T20:33:49.4341994Z         torch.manual_seed(2025)
2025-05-07T20:33:49.4342237Z     
2025-05-07T20:33:49.4342513Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.4342938Z     
2025-05-07T20:33:49.4343141Z         x_sign = torch.sign(x)
2025-05-07T20:33:49.4343427Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:49.4343746Z         x = x_sign * x_clamp
2025-05-07T20:33:49.4344047Z         x0 = x[:, :D]
2025-05-07T20:33:49.4344253Z         x1 = x[:, D:]
2025-05-07T20:33:49.4344461Z     
2025-05-07T20:33:49.4344647Z         if contiguous:
2025-05-07T20:33:49.4344870Z             x0 = x0.contiguous()
2025-05-07T20:33:49.4345133Z             x1 = x1.contiguous()
2025-05-07T20:33:49.4345374Z     
2025-05-07T20:33:49.4345571Z         if scale_ub is not None:
2025-05-07T20:33:49.4345840Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:49.4346180Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:49.4346492Z             )
2025-05-07T20:33:49.4346676Z         else:
2025-05-07T20:33:49.4346892Z             scale_ub_tensor = None
2025-05-07T20:33:49.4347143Z     
2025-05-07T20:33:49.4347377Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:49.4347701Z             op = silu_mul_quant
2025-05-07T20:33:49.4347951Z             if compiled:
2025-05-07T20:33:49.4348253Z                 op = torch.compile(op)
2025-05-07T20:33:49.4348560Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.4348843Z     
2025-05-07T20:33:49.4349026Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:49.4349194Z 
2025-05-07T20:33:49.4349296Z moe/activation_test.py:117: 
2025-05-07T20:33:49.4349593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.4349956Z moe/activation_test.py:115: in fn
2025-05-07T20:33:49.4350242Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.4350822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:49.4351409Z     return fn(*args, **kwargs)
2025-05-07T20:33:49.4352103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:49.4352820Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:49.4353379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:49.4354098Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:49.4354794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:49.4355348Z     kernel = self.compile(
2025-05-07T20:33:49.4355913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:49.4356601Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:49.4357001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.4357302Z 
2025-05-07T20:33:49.4357516Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1cf69880>
2025-05-07T20:33:49.4358652Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:49.4360134Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1cdfe020>}
2025-05-07T20:33:49.4361542Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:49.4362620Z context = <triton._C.libtriton.ir.context object at 0x7f1b1dd876b0>
2025-05-07T20:33:49.4362925Z 
2025-05-07T20:33:49.4363095Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:49.4363680Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:49.4364158Z                            module_map=module_map)
2025-05-07T20:33:49.4364527Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:49.4364930Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:49.4365193Z E       ^
2025-05-07T20:33:49.4365664Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:49.4366140Z 
2025-05-07T20:33:49.4366575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:49.4367117Z 
2025-05-07T20:33:49.4367220Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.4367640Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.4368053Z     T=2048,
2025-05-07T20:33:49.4368243Z     D=7168,
2025-05-07T20:33:49.4368437Z     scale_ub=None,
2025-05-07T20:33:49.4368647Z     contiguous=True,
2025-05-07T20:33:49.4368871Z     compiled=True,
2025-05-07T20:33:49.4369069Z )
2025-05-07T20:33:49.4369462Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.4369990Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:49.4370271Z 
2025-05-07T20:33:49.4370349Z     @given(
2025-05-07T20:33:49.4370578Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.4370887Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.4371194Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.4371529Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.4371860Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.4372153Z     )
2025-05-07T20:33:49.4372509Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.4372962Z     def test_silu_mul_quant(
2025-05-07T20:33:49.4373205Z         self,
2025-05-07T20:33:49.4373400Z         T: int,
2025-05-07T20:33:49.4373590Z         D: int,
2025-05-07T20:33:49.4373806Z         scale_ub: Optional[float],
2025-05-07T20:33:49.4374083Z         contiguous: bool,
2025-05-07T20:33:49.4374321Z         compiled: bool,
2025-05-07T20:33:49.4374676Z     ) -> None:
2025-05-07T20:33:49.4374893Z         torch.manual_seed(2025)
2025-05-07T20:33:49.4375141Z     
2025-05-07T20:33:49.4375410Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.4375763Z     
2025-05-07T20:33:49.4375962Z         x_sign = torch.sign(x)
2025-05-07T20:33:49.4376249Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:49.4376570Z         x = x_sign * x_clamp
2025-05-07T20:33:49.4376818Z         x0 = x[:, :D]
2025-05-07T20:33:49.4377030Z         x1 = x[:, D:]
2025-05-07T20:33:49.4377239Z     
2025-05-07T20:33:49.4377488Z         if contiguous:
2025-05-07T20:33:49.4377717Z             x0 = x0.contiguous()
2025-05-07T20:33:49.4377985Z             x1 = x1.contiguous()
2025-05-07T20:33:49.4378235Z     
2025-05-07T20:33:49.4378427Z         if scale_ub is not None:
2025-05-07T20:33:49.4378718Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:49.4379067Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:49.4379383Z             )
2025-05-07T20:33:49.4379577Z         else:
2025-05-07T20:33:49.4379787Z             scale_ub_tensor = None
2025-05-07T20:33:49.4380045Z     
2025-05-07T20:33:49.4380270Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:49.4380595Z             op = silu_mul_quant
2025-05-07T20:33:49.4380847Z             if compiled:
2025-05-07T20:33:49.4381094Z                 op = torch.compile(op)
2025-05-07T20:33:49.4381399Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.4381674Z     
2025-05-07T20:33:49.4381863Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:49.4382037Z 
2025-05-07T20:33:49.4382184Z moe/activation_test.py:117: 
2025-05-07T20:33:49.4382480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.4382814Z moe/activation_test.py:115: in fn
2025-05-07T20:33:49.4383105Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.4383724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:49.4384313Z     return fn(*args, **kwargs)
2025-05-07T20:33:49.4384999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:49.4385722Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:49.4386279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:49.4386985Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:49.4387689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:49.4388245Z     kernel = self.compile(
2025-05-07T20:33:49.4388848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:49.4389541Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:49.4389956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.4390196Z 
2025-05-07T20:33:49.4390415Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1ddc0380>
2025-05-07T20:33:49.4391539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:49.4392962Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1cdff240>}
2025-05-07T20:33:49.4394376Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:49.4395459Z context = <triton._C.libtriton.ir.context object at 0x7f1b1dd4d630>
2025-05-07T20:33:49.4395759Z 
2025-05-07T20:33:49.4395934Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:49.4396465Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:49.4396940Z                            module_map=module_map)
2025-05-07T20:33:49.4397310Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:49.4397669Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:49.4397932Z E       ^
2025-05-07T20:33:49.4398463Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:49.4398934Z 
2025-05-07T20:33:49.4399383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:49.4399974Z 
2025-05-07T20:33:49.5003509Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.5004416Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.5005496Z     T=16384,
2025-05-07T20:33:49.5005876Z     D=5120,
2025-05-07T20:33:49.5006246Z     scale_ub=None,
2025-05-07T20:33:49.5006659Z     contiguous=False,
2025-05-07T20:33:49.5007107Z     compiled=False,
2025-05-07T20:33:49.5007506Z )
2025-05-07T20:33:49.5008140Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.5009147Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:49.5009724Z 
2025-05-07T20:33:49.5009812Z     @given(
2025-05-07T20:33:49.5010186Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.5010511Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.5010819Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.5011146Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.5011573Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.5011864Z     )
2025-05-07T20:33:49.5012212Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.5012665Z     def test_silu_mul_quant(
2025-05-07T20:33:49.5012902Z         self,
2025-05-07T20:33:49.5013083Z         T: int,
2025-05-07T20:33:49.5013281Z         D: int,
2025-05-07T20:33:49.5013500Z         scale_ub: Optional[float],
2025-05-07T20:33:49.5013776Z         contiguous: bool,
2025-05-07T20:33:49.5014006Z         compiled: bool,
2025-05-07T20:33:49.5014230Z     ) -> None:
2025-05-07T20:33:49.5014445Z         torch.manual_seed(2025)
2025-05-07T20:33:49.5014785Z     
2025-05-07T20:33:49.5015066Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.5015420Z     
2025-05-07T20:33:49.5015607Z         x_sign = torch.sign(x)
2025-05-07T20:33:49.5015963Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:49.5018133Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:49.5020149Z 
2025-05-07T20:33:49.5020267Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:49.5020484Z 
2025-05-07T20:33:49.5020592Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.5021013Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.5021431Z     T=4096,
2025-05-07T20:33:49.5021617Z     D=7168,
2025-05-07T20:33:49.5021802Z     scale_ub=1200.0,
2025-05-07T20:33:49.5022017Z     contiguous=True,
2025-05-07T20:33:49.5022236Z     compiled=True,
2025-05-07T20:33:49.5022437Z )
2025-05-07T20:33:49.5022751Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.5023256Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:49.5023532Z 
2025-05-07T20:33:49.5023618Z     @given(
2025-05-07T20:33:49.5023837Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.5024155Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.5024458Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.5024862Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.5025194Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.5025661Z     )
2025-05-07T20:33:49.5026010Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.5026465Z     def test_silu_mul_quant(
2025-05-07T20:33:49.5026705Z         self,
2025-05-07T20:33:49.5026899Z         T: int,
2025-05-07T20:33:49.5027090Z         D: int,
2025-05-07T20:33:49.5027305Z         scale_ub: Optional[float],
2025-05-07T20:33:49.5027581Z         contiguous: bool,
2025-05-07T20:33:49.5027816Z         compiled: bool,
2025-05-07T20:33:49.5028041Z     ) -> None:
2025-05-07T20:33:49.5028279Z         torch.manual_seed(2025)
2025-05-07T20:33:49.5028517Z     
2025-05-07T20:33:49.5028784Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.5029139Z     
2025-05-07T20:33:49.5029338Z         x_sign = torch.sign(x)
2025-05-07T20:33:49.5029626Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:49.5031888Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:49.5033961Z 
2025-05-07T20:33:49.5034078Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:49.5034305Z 
2025-05-07T20:33:49.5034406Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.5034823Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.5035232Z     T=16384,
2025-05-07T20:33:49.5035423Z     D=7168,
2025-05-07T20:33:49.5035609Z     scale_ub=None,
2025-05-07T20:33:49.5035817Z     contiguous=False,
2025-05-07T20:33:49.5036042Z     compiled=False,
2025-05-07T20:33:49.5036247Z )
2025-05-07T20:33:49.5036617Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.5037129Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:49.5037422Z 
2025-05-07T20:33:49.5037498Z     @given(
2025-05-07T20:33:49.5037725Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.5038033Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.5038342Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.5038671Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.5038998Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.5039287Z     )
2025-05-07T20:33:49.5039636Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.5040134Z     def test_silu_mul_quant(
2025-05-07T20:33:49.5040370Z         self,
2025-05-07T20:33:49.5040563Z         T: int,
2025-05-07T20:33:49.5040757Z         D: int,
2025-05-07T20:33:49.5040964Z         scale_ub: Optional[float],
2025-05-07T20:33:49.5041241Z         contiguous: bool,
2025-05-07T20:33:49.5041477Z         compiled: bool,
2025-05-07T20:33:49.5041695Z     ) -> None:
2025-05-07T20:33:49.5041904Z         torch.manual_seed(2025)
2025-05-07T20:33:49.5042142Z     
2025-05-07T20:33:49.5042407Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.5044605Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:49.5046689Z 
2025-05-07T20:33:49.5046806Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:49.5047022Z 
2025-05-07T20:33:49.5047125Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.5047544Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.5047951Z     T=2048,
2025-05-07T20:33:49.5048140Z     D=7168,
2025-05-07T20:33:49.5048328Z     scale_ub=1200.0,
2025-05-07T20:33:49.5048542Z     contiguous=True,
2025-05-07T20:33:49.5048757Z     compiled=True,
2025-05-07T20:33:49.5048956Z )
2025-05-07T20:33:49.5049287Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.5049821Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:49.5050100Z 
2025-05-07T20:33:49.5050190Z     @given(
2025-05-07T20:33:49.5050410Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.5050777Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.5051096Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.5051437Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.5051769Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.5052098Z     )
2025-05-07T20:33:49.5052451Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.5052902Z     def test_silu_mul_quant(
2025-05-07T20:33:49.5053145Z         self,
2025-05-07T20:33:49.5053339Z         T: int,
2025-05-07T20:33:49.5053531Z         D: int,
2025-05-07T20:33:49.5053752Z         scale_ub: Optional[float],
2025-05-07T20:33:49.5054016Z         contiguous: bool,
2025-05-07T20:33:49.5054245Z         compiled: bool,
2025-05-07T20:33:49.5054462Z     ) -> None:
2025-05-07T20:33:49.5054749Z         torch.manual_seed(2025)
2025-05-07T20:33:49.5054987Z     
2025-05-07T20:33:49.5055257Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.5055608Z     
2025-05-07T20:33:49.5055795Z         x_sign = torch.sign(x)
2025-05-07T20:33:49.5056145Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:49.5058270Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:49.5060265Z 
2025-05-07T20:33:49.5060383Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:49.5060601Z 
2025-05-07T20:33:49.5060707Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.5061128Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.5061545Z     T=2048,
2025-05-07T20:33:49.5061726Z     D=7168,
2025-05-07T20:33:49.5061913Z     scale_ub=None,
2025-05-07T20:33:49.5062119Z     contiguous=True,
2025-05-07T20:33:49.5062341Z     compiled=False,
2025-05-07T20:33:49.5062541Z )
2025-05-07T20:33:49.6188819Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.6189345Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:49.6189669Z 
2025-05-07T20:33:49.6189751Z     @given(
2025-05-07T20:33:49.6189978Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.6190287Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.6190603Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.6190941Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.6191395Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.6191679Z     )
2025-05-07T20:33:49.6192030Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.6192487Z     def test_silu_mul_quant(
2025-05-07T20:33:49.6192720Z         self,
2025-05-07T20:33:49.6192910Z         T: int,
2025-05-07T20:33:49.6193102Z         D: int,
2025-05-07T20:33:49.6193314Z         scale_ub: Optional[float],
2025-05-07T20:33:49.6193588Z         contiguous: bool,
2025-05-07T20:33:49.6193826Z         compiled: bool,
2025-05-07T20:33:49.6194039Z     ) -> None:
2025-05-07T20:33:49.6194252Z         torch.manual_seed(2025)
2025-05-07T20:33:49.6194491Z     
2025-05-07T20:33:49.6194757Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.6195109Z     
2025-05-07T20:33:49.6195299Z >       x_sign = torch.sign(x)
2025-05-07T20:33:49.6197449Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:49.6199555Z 
2025-05-07T20:33:49.6199677Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:49.6199894Z 
2025-05-07T20:33:49.6199994Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.6200418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.6200838Z     T=1,
2025-05-07T20:33:49.6201013Z     D=7168,
2025-05-07T20:33:49.6201203Z     scale_ub=1200.0,
2025-05-07T20:33:49.6201425Z     contiguous=True,
2025-05-07T20:33:49.6201645Z     compiled=False,
2025-05-07T20:33:49.6201844Z )
2025-05-07T20:33:49.6202168Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.6202667Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:49.6203007Z 
2025-05-07T20:33:49.6203089Z     @given(
2025-05-07T20:33:49.6203327Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.6203645Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.6203946Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.6204278Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.6204610Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.6204894Z     )
2025-05-07T20:33:49.6205241Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.6205692Z     def test_silu_mul_quant(
2025-05-07T20:33:49.6205928Z         self,
2025-05-07T20:33:49.6206123Z         T: int,
2025-05-07T20:33:49.6206314Z         D: int,
2025-05-07T20:33:49.6206529Z         scale_ub: Optional[float],
2025-05-07T20:33:49.6206799Z         contiguous: bool,
2025-05-07T20:33:49.6207041Z         compiled: bool,
2025-05-07T20:33:49.6207267Z     ) -> None:
2025-05-07T20:33:49.6207477Z         torch.manual_seed(2025)
2025-05-07T20:33:49.6207720Z     
2025-05-07T20:33:49.6207986Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.6208327Z     
2025-05-07T20:33:49.6208511Z         x_sign = torch.sign(x)
2025-05-07T20:33:49.6208799Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:49.6209110Z         x = x_sign * x_clamp
2025-05-07T20:33:49.6209351Z         x0 = x[:, :D]
2025-05-07T20:33:49.6209564Z         x1 = x[:, D:]
2025-05-07T20:33:49.6209763Z     
2025-05-07T20:33:49.6209953Z         if contiguous:
2025-05-07T20:33:49.6210188Z             x0 = x0.contiguous()
2025-05-07T20:33:49.6217511Z             x1 = x1.contiguous()
2025-05-07T20:33:49.6217863Z     
2025-05-07T20:33:49.6218060Z         if scale_ub is not None:
2025-05-07T20:33:49.6218349Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:49.6218686Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:49.6219003Z             )
2025-05-07T20:33:49.6219200Z         else:
2025-05-07T20:33:49.6219436Z             scale_ub_tensor = None
2025-05-07T20:33:49.6219717Z     
2025-05-07T20:33:49.6219955Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:49.6220279Z             op = silu_mul_quant
2025-05-07T20:33:49.6220541Z             if compiled:
2025-05-07T20:33:49.6220802Z                 op = torch.compile(op)
2025-05-07T20:33:49.6221106Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.6221387Z     
2025-05-07T20:33:49.6221576Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:49.6221742Z 
2025-05-07T20:33:49.6221846Z moe/activation_test.py:117: 
2025-05-07T20:33:49.6222144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.6222542Z moe/activation_test.py:115: in fn
2025-05-07T20:33:49.6222830Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.6223548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:49.6224327Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:49.6224893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:49.6225985Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:49.6226679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:49.6227240Z     kernel = self.compile(
2025-05-07T20:33:49.6227811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:49.6228504Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:49.6228919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.6229166Z 
2025-05-07T20:33:49.6229475Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1ddc3e90>
2025-05-07T20:33:49.6230613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:49.6232042Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1dd96520>}
2025-05-07T20:33:49.6233453Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:49.6234544Z context = <triton._C.libtriton.ir.context object at 0x7f1b1cec11b0>
2025-05-07T20:33:49.6234841Z 
2025-05-07T20:33:49.6235015Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:49.6235555Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:49.6236034Z                            module_map=module_map)
2025-05-07T20:33:49.6236404Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:49.6236768Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:49.6237033Z E       ^
2025-05-07T20:33:49.6237508Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:49.6237980Z 
2025-05-07T20:33:49.6238425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:49.6238967Z 
2025-05-07T20:33:49.6239147Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.6239619Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.6240043Z     T=128,
2025-05-07T20:33:49.6240236Z     D=5120,
2025-05-07T20:33:49.6240421Z     scale_ub=None,
2025-05-07T20:33:49.6240638Z     contiguous=True,
2025-05-07T20:33:49.6240870Z     compiled=False,
2025-05-07T20:33:49.6241082Z )
2025-05-07T20:33:49.6917407Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.6917950Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:49.6918286Z 
2025-05-07T20:33:49.6918401Z     @given(
2025-05-07T20:33:49.6918723Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.6919142Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.6919870Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.6920608Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.6921288Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.6921868Z     )
2025-05-07T20:33:49.6922891Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.6923816Z     def test_silu_mul_quant(
2025-05-07T20:33:49.6924303Z         self,
2025-05-07T20:33:49.6924700Z         T: int,
2025-05-07T20:33:49.6925217Z         D: int,
2025-05-07T20:33:49.6926077Z         scale_ub: Optional[float],
2025-05-07T20:33:49.6926749Z         contiguous: bool,
2025-05-07T20:33:49.6927279Z         compiled: bool,
2025-05-07T20:33:49.6927716Z     ) -> None:
2025-05-07T20:33:49.6928133Z         torch.manual_seed(2025)
2025-05-07T20:33:49.6928612Z     
2025-05-07T20:33:49.6929153Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.6929613Z     
2025-05-07T20:33:49.6929828Z         x_sign = torch.sign(x)
2025-05-07T20:33:49.6930128Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:49.6930442Z         x = x_sign * x_clamp
2025-05-07T20:33:49.6930681Z         x0 = x[:, :D]
2025-05-07T20:33:49.6930906Z         x1 = x[:, D:]
2025-05-07T20:33:49.6931110Z     
2025-05-07T20:33:49.6931294Z         if contiguous:
2025-05-07T20:33:49.6931527Z             x0 = x0.contiguous()
2025-05-07T20:33:49.6931873Z             x1 = x1.contiguous()
2025-05-07T20:33:49.6932126Z     
2025-05-07T20:33:49.6932322Z         if scale_ub is not None:
2025-05-07T20:33:49.6932592Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:49.6932937Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:49.6933258Z             )
2025-05-07T20:33:49.6933442Z         else:
2025-05-07T20:33:49.6933652Z             scale_ub_tensor = None
2025-05-07T20:33:49.6933910Z     
2025-05-07T20:33:49.6934144Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:49.6934463Z             op = silu_mul_quant
2025-05-07T20:33:49.6934869Z             if compiled:
2025-05-07T20:33:49.6935118Z                 op = torch.compile(op)
2025-05-07T20:33:49.6935422Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.6935703Z     
2025-05-07T20:33:49.6935895Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:49.6936085Z 
2025-05-07T20:33:49.6936184Z moe/activation_test.py:117: 
2025-05-07T20:33:49.6936480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.6936823Z moe/activation_test.py:115: in fn
2025-05-07T20:33:49.6937106Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.6937818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:49.6938538Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:49.6939095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:49.6939806Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:49.6940576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:49.6941130Z     kernel = self.compile(
2025-05-07T20:33:49.6941693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:49.6942384Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:49.6942789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.6943028Z 
2025-05-07T20:33:49.6943238Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1ceea0c0>
2025-05-07T20:33:49.6944357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:49.6945847Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1dd97420>}
2025-05-07T20:33:49.6947261Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:49.6948411Z context = <triton._C.libtriton.ir.context object at 0x7f1b1cc267b0>
2025-05-07T20:33:49.6948719Z 
2025-05-07T20:33:49.6948890Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:49.6949454Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:49.6949970Z                            module_map=module_map)
2025-05-07T20:33:49.6950346Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:49.6950717Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:49.6950983Z E       ^
2025-05-07T20:33:49.6951479Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:49.6951957Z 
2025-05-07T20:33:49.6952441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:49.6952984Z 
2025-05-07T20:33:49.6953095Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.6953528Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.6953953Z     T=128,
2025-05-07T20:33:49.6954148Z     D=7168,
2025-05-07T20:33:49.6954342Z     scale_ub=None,
2025-05-07T20:33:49.6954564Z     contiguous=True,
2025-05-07T20:33:49.6954793Z     compiled=False,
2025-05-07T20:33:49.6955006Z )
2025-05-07T20:33:49.6955332Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.6955845Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:49.6956125Z 
2025-05-07T20:33:49.6956216Z     @given(
2025-05-07T20:33:49.6956444Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.6956776Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.6957094Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.6957429Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.6957769Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.6958071Z     )
2025-05-07T20:33:49.6958433Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.6958894Z     def test_silu_mul_quant(
2025-05-07T20:33:49.6959151Z         self,
2025-05-07T20:33:49.6959353Z         T: int,
2025-05-07T20:33:49.6959574Z         D: int,
2025-05-07T20:33:49.6959824Z         scale_ub: Optional[float],
2025-05-07T20:33:49.6960104Z         contiguous: bool,
2025-05-07T20:33:49.6960344Z         compiled: bool,
2025-05-07T20:33:49.6960571Z     ) -> None:
2025-05-07T20:33:49.6960791Z         torch.manual_seed(2025)
2025-05-07T20:33:49.6961087Z     
2025-05-07T20:33:49.6961364Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.6961708Z     
2025-05-07T20:33:49.6961894Z         x_sign = torch.sign(x)
2025-05-07T20:33:49.6962184Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:49.6962505Z         x = x_sign * x_clamp
2025-05-07T20:33:49.6962750Z         x0 = x[:, :D]
2025-05-07T20:33:49.6962971Z         x1 = x[:, D:]
2025-05-07T20:33:49.6963186Z     
2025-05-07T20:33:49.6963370Z         if contiguous:
2025-05-07T20:33:49.6963598Z             x0 = x0.contiguous()
2025-05-07T20:33:49.6963857Z             x1 = x1.contiguous()
2025-05-07T20:33:49.6964098Z     
2025-05-07T20:33:49.6964285Z         if scale_ub is not None:
2025-05-07T20:33:49.6964560Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:49.6964903Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:49.6965209Z             )
2025-05-07T20:33:49.6965413Z         else:
2025-05-07T20:33:49.6965621Z             scale_ub_tensor = None
2025-05-07T20:33:49.6965942Z     
2025-05-07T20:33:49.6966172Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:49.6966482Z             op = silu_mul_quant
2025-05-07T20:33:49.6966737Z             if compiled:
2025-05-07T20:33:49.6966980Z                 op = torch.compile(op)
2025-05-07T20:33:49.6967322Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.6967593Z     
2025-05-07T20:33:49.6967780Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:49.6967945Z 
2025-05-07T20:33:49.6968041Z moe/activation_test.py:117: 
2025-05-07T20:33:49.6968339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.6968677Z moe/activation_test.py:115: in fn
2025-05-07T20:33:49.6968964Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.6969677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:49.6970400Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:49.6970963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:49.6971722Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:49.6972422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:49.6972974Z     kernel = self.compile(
2025-05-07T20:33:49.6973534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:49.6974208Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:49.6974707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.6974946Z 
2025-05-07T20:33:49.6975159Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1ceea570>
2025-05-07T20:33:49.6976289Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:49.6977709Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1cc204a0>}
2025-05-07T20:33:49.6979124Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:49.6980264Z context = <triton._C.libtriton.ir.context object at 0x7f1b1cc117f0>
2025-05-07T20:33:49.6980565Z 
2025-05-07T20:33:49.6980738Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:49.6981276Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:49.6981816Z                            module_map=module_map)
2025-05-07T20:33:49.6982195Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:49.6982562Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:49.6982827Z E       ^
2025-05-07T20:33:49.6983309Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:49.6983782Z 
2025-05-07T20:33:49.6984224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:49.6984764Z 
2025-05-07T20:33:49.6984875Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.6985300Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.6985721Z     T=2048,
2025-05-07T20:33:49.6985913Z     D=7168,
2025-05-07T20:33:49.6986103Z     scale_ub=1200.0,
2025-05-07T20:33:49.6986320Z     contiguous=True,
2025-05-07T20:33:49.6986544Z     compiled=False,
2025-05-07T20:33:49.6986738Z )
2025-05-07T20:33:49.7792225Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.7792802Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:49.7793092Z 
2025-05-07T20:33:49.7793183Z     @given(
2025-05-07T20:33:49.7793483Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.7793793Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.7794098Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.7794429Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.7794756Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.7795040Z     )
2025-05-07T20:33:49.7795389Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.7795837Z     def test_silu_mul_quant(
2025-05-07T20:33:49.7796080Z         self,
2025-05-07T20:33:49.7796270Z         T: int,
2025-05-07T20:33:49.7796457Z         D: int,
2025-05-07T20:33:49.7796673Z         scale_ub: Optional[float],
2025-05-07T20:33:49.7796944Z         contiguous: bool,
2025-05-07T20:33:49.7797179Z         compiled: bool,
2025-05-07T20:33:49.7797404Z     ) -> None:
2025-05-07T20:33:49.7797680Z         torch.manual_seed(2025)
2025-05-07T20:33:49.7797922Z     
2025-05-07T20:33:49.7798192Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.7800379Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:49.7802380Z 
2025-05-07T20:33:49.7802499Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:49.7802716Z 
2025-05-07T20:33:49.7802824Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.7803240Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.7803660Z     T=1,
2025-05-07T20:33:49.7803844Z     D=5120,
2025-05-07T20:33:49.7804030Z     scale_ub=1200.0,
2025-05-07T20:33:49.7804253Z     contiguous=True,
2025-05-07T20:33:49.7804472Z     compiled=False,
2025-05-07T20:33:49.7804680Z )
2025-05-07T20:33:49.7804992Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.7805490Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:49.7805761Z 
2025-05-07T20:33:49.7805846Z     @given(
2025-05-07T20:33:49.7806067Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.7806456Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.7806771Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.7807098Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.7807430Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.7807721Z     )
2025-05-07T20:33:49.7808069Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.7808521Z     def test_silu_mul_quant(
2025-05-07T20:33:49.7808758Z         self,
2025-05-07T20:33:49.7808947Z         T: int,
2025-05-07T20:33:49.7809134Z         D: int,
2025-05-07T20:33:49.7809349Z         scale_ub: Optional[float],
2025-05-07T20:33:49.7809620Z         contiguous: bool,
2025-05-07T20:33:49.7809857Z         compiled: bool,
2025-05-07T20:33:49.7810113Z     ) -> None:
2025-05-07T20:33:49.7810332Z         torch.manual_seed(2025)
2025-05-07T20:33:49.7810566Z     
2025-05-07T20:33:49.7810842Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.7811196Z     
2025-05-07T20:33:49.7811381Z         x_sign = torch.sign(x)
2025-05-07T20:33:49.7811716Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:49.7812035Z         x = x_sign * x_clamp
2025-05-07T20:33:49.7812266Z         x0 = x[:, :D]
2025-05-07T20:33:49.7812480Z         x1 = x[:, D:]
2025-05-07T20:33:49.7812723Z     
2025-05-07T20:33:49.7812898Z         if contiguous:
2025-05-07T20:33:49.7813125Z             x0 = x0.contiguous()
2025-05-07T20:33:49.7813383Z             x1 = x1.contiguous()
2025-05-07T20:33:49.7813623Z     
2025-05-07T20:33:49.7813805Z         if scale_ub is not None:
2025-05-07T20:33:49.7814075Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:49.7814406Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:49.7814808Z             )
2025-05-07T20:33:49.7815002Z         else:
2025-05-07T20:33:49.7815221Z             scale_ub_tensor = None
2025-05-07T20:33:49.7815470Z     
2025-05-07T20:33:49.7815706Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:49.7816030Z             op = silu_mul_quant
2025-05-07T20:33:49.7816276Z             if compiled:
2025-05-07T20:33:49.7816525Z                 op = torch.compile(op)
2025-05-07T20:33:49.7816875Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.7817155Z     
2025-05-07T20:33:49.7817351Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:49.7817518Z 
2025-05-07T20:33:49.7817622Z moe/activation_test.py:117: 
2025-05-07T20:33:49.7817921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.7818257Z moe/activation_test.py:115: in fn
2025-05-07T20:33:49.7818543Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:49.7819263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:49.7820012Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:49.7820565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:49.7821291Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:49.7821993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:49.7822567Z     kernel = self.compile(
2025-05-07T20:33:49.7823123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:49.7823817Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:49.7824230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:49.7824465Z 
2025-05-07T20:33:49.7824677Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1ceeaf30>
2025-05-07T20:33:49.7826127Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:49.7827679Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1cc21a80>}
2025-05-07T20:33:49.7829102Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:49.7830198Z context = <triton._C.libtriton.ir.context object at 0x7f1b1ccd8570>
2025-05-07T20:33:49.7830501Z 
2025-05-07T20:33:49.7830674Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:49.7831220Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:49.7831708Z                            module_map=module_map)
2025-05-07T20:33:49.7832090Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:49.7832521Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:49.7832800Z E       ^
2025-05-07T20:33:49.7833287Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:49.7833819Z 
2025-05-07T20:33:49.7834263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:49.7834814Z 
2025-05-07T20:33:49.7834924Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.7835355Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.7835778Z     T=2048,
2025-05-07T20:33:49.7835963Z     D=5120,
2025-05-07T20:33:49.7836154Z     scale_ub=None,
2025-05-07T20:33:49.7836375Z     contiguous=True,
2025-05-07T20:33:49.7836596Z     compiled=False,
2025-05-07T20:33:49.7836804Z )
2025-05-07T20:33:49.7837129Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.7837642Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:49.7837928Z 
2025-05-07T20:33:49.7838008Z     @given(
2025-05-07T20:33:49.7838307Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.7838630Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.7838937Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.7839273Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.7839660Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.7839946Z     )
2025-05-07T20:33:49.7840305Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.7840768Z     def test_silu_mul_quant(
2025-05-07T20:33:49.7841008Z         self,
2025-05-07T20:33:49.7841199Z         T: int,
2025-05-07T20:33:49.7841399Z         D: int,
2025-05-07T20:33:49.7841614Z         scale_ub: Optional[float],
2025-05-07T20:33:49.7841891Z         contiguous: bool,
2025-05-07T20:33:49.7842136Z         compiled: bool,
2025-05-07T20:33:49.7842351Z     ) -> None:
2025-05-07T20:33:49.7842570Z         torch.manual_seed(2025)
2025-05-07T20:33:49.7842811Z     
2025-05-07T20:33:49.7843084Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.7843441Z     
2025-05-07T20:33:49.7843638Z >       x_sign = torch.sign(x)
2025-05-07T20:33:49.7845721Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:49.7847781Z 
2025-05-07T20:33:49.7847913Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:49.7848134Z 
2025-05-07T20:33:49.7848241Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.7848674Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.7849096Z     T=16384,
2025-05-07T20:33:49.7849283Z     D=5120,
2025-05-07T20:33:49.7849483Z     scale_ub=None,
2025-05-07T20:33:49.7849707Z     contiguous=True,
2025-05-07T20:33:49.7849945Z     compiled=False,
2025-05-07T20:33:49.7850153Z )
2025-05-07T20:33:49.8614927Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.8615505Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:49.8615796Z 
2025-05-07T20:33:49.8615887Z     @given(
2025-05-07T20:33:49.8616113Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.8616438Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.8623975Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.8624457Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.8624803Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.8625103Z     )
2025-05-07T20:33:49.8625637Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.8626168Z     def test_silu_mul_quant(
2025-05-07T20:33:49.8626416Z         self,
2025-05-07T20:33:49.8626614Z         T: int,
2025-05-07T20:33:49.8626813Z         D: int,
2025-05-07T20:33:49.8627037Z         scale_ub: Optional[float],
2025-05-07T20:33:49.8627325Z         contiguous: bool,
2025-05-07T20:33:49.8627565Z         compiled: bool,
2025-05-07T20:33:49.8627794Z     ) -> None:
2025-05-07T20:33:49.8628018Z         torch.manual_seed(2025)
2025-05-07T20:33:49.8628258Z     
2025-05-07T20:33:49.8628538Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.8630863Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:49.8632882Z 
2025-05-07T20:33:49.8633006Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:49.8633226Z 
2025-05-07T20:33:49.8633346Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.8633776Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.8634209Z     T=4096,
2025-05-07T20:33:49.8634411Z     D=5120,
2025-05-07T20:33:49.8634608Z     scale_ub=None,
2025-05-07T20:33:49.8634837Z     contiguous=True,
2025-05-07T20:33:49.8635068Z     compiled=False,
2025-05-07T20:33:49.8635281Z )
2025-05-07T20:33:49.8635613Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.8636135Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:49.8636422Z 
2025-05-07T20:33:49.8636519Z     @given(
2025-05-07T20:33:49.8636752Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.8637088Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.8637406Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.8637747Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.8638094Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.8638390Z     )
2025-05-07T20:33:49.8638746Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.8639210Z     def test_silu_mul_quant(
2025-05-07T20:33:49.8639529Z         self,
2025-05-07T20:33:49.8639729Z         T: int,
2025-05-07T20:33:49.8639931Z         D: int,
2025-05-07T20:33:49.8640155Z         scale_ub: Optional[float],
2025-05-07T20:33:49.8640433Z         contiguous: bool,
2025-05-07T20:33:49.8640686Z         compiled: bool,
2025-05-07T20:33:49.8640919Z     ) -> None:
2025-05-07T20:33:49.8641142Z         torch.manual_seed(2025)
2025-05-07T20:33:49.8641385Z     
2025-05-07T20:33:49.8641661Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.8643906Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:49.8645900Z 
2025-05-07T20:33:49.8646029Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:49.8646247Z 
2025-05-07T20:33:49.8646357Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.8646784Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.8647250Z     T=2048,
2025-05-07T20:33:49.8647442Z     D=5120,
2025-05-07T20:33:49.8647622Z     scale_ub=None,
2025-05-07T20:33:49.8647839Z     contiguous=False,
2025-05-07T20:33:49.8648069Z     compiled=False,
2025-05-07T20:33:49.8648265Z )
2025-05-07T20:33:49.8648587Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.8649087Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:49.8649373Z 
2025-05-07T20:33:49.8649452Z     @given(
2025-05-07T20:33:49.8649691Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.8650015Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.8650329Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.8650671Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.8651057Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.8651359Z     )
2025-05-07T20:33:49.8651719Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.8652184Z     def test_silu_mul_quant(
2025-05-07T20:33:49.8652437Z         self,
2025-05-07T20:33:49.8652637Z         T: int,
2025-05-07T20:33:49.8652846Z         D: int,
2025-05-07T20:33:49.8653067Z         scale_ub: Optional[float],
2025-05-07T20:33:49.8653342Z         contiguous: bool,
2025-05-07T20:33:49.8653589Z         compiled: bool,
2025-05-07T20:33:49.8653817Z     ) -> None:
2025-05-07T20:33:49.8654028Z         torch.manual_seed(2025)
2025-05-07T20:33:49.8654278Z     
2025-05-07T20:33:49.8654672Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.8656862Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:49.8658858Z 
2025-05-07T20:33:49.8658979Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:49.8659195Z 
2025-05-07T20:33:49.8659344Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.8659787Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.8660201Z     T=4096,
2025-05-07T20:33:49.8660440Z     D=7168,
2025-05-07T20:33:49.8660626Z     scale_ub=None,
2025-05-07T20:33:49.8660842Z     contiguous=True,
2025-05-07T20:33:49.8661068Z     compiled=True,
2025-05-07T20:33:49.8661264Z )
2025-05-07T20:33:49.8661586Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.8662096Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:49.8662376Z 
2025-05-07T20:33:49.8662454Z     @given(
2025-05-07T20:33:49.8662689Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.8663012Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.8663331Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.8663665Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.8664004Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.8664303Z     )
2025-05-07T20:33:49.8664658Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.8665126Z     def test_silu_mul_quant(
2025-05-07T20:33:49.8665384Z         self,
2025-05-07T20:33:49.8665629Z         T: int,
2025-05-07T20:33:49.8665838Z         D: int,
2025-05-07T20:33:49.8666064Z         scale_ub: Optional[float],
2025-05-07T20:33:49.8666347Z         contiguous: bool,
2025-05-07T20:33:49.8666597Z         compiled: bool,
2025-05-07T20:33:49.8666870Z     ) -> None:
2025-05-07T20:33:49.8667083Z         torch.manual_seed(2025)
2025-05-07T20:33:49.8667326Z     
2025-05-07T20:33:49.8667603Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.8669794Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:49.8671797Z 
2025-05-07T20:33:49.8671919Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:49.8672178Z 
2025-05-07T20:33:49.8672280Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.8672703Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.8673120Z     T=2048,
2025-05-07T20:33:49.8673295Z     D=5120,
2025-05-07T20:33:49.8673485Z     scale_ub=1200.0,
2025-05-07T20:33:49.8673697Z     contiguous=False,
2025-05-07T20:33:49.8673922Z     compiled=False,
2025-05-07T20:33:49.8674119Z )
2025-05-07T20:33:49.8674431Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.8674934Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:49.8675218Z 
2025-05-07T20:33:49.8675297Z     @given(
2025-05-07T20:33:49.8675518Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.8675830Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.8676131Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.8676461Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.8676793Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.8677080Z     )
2025-05-07T20:33:49.8677428Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.8677882Z     def test_silu_mul_quant(
2025-05-07T20:33:49.8678120Z         self,
2025-05-07T20:33:49.8678303Z         T: int,
2025-05-07T20:33:49.8678496Z         D: int,
2025-05-07T20:33:49.8678711Z         scale_ub: Optional[float],
2025-05-07T20:33:49.8678978Z         contiguous: bool,
2025-05-07T20:33:49.8679215Z         compiled: bool,
2025-05-07T20:33:49.8679431Z     ) -> None:
2025-05-07T20:33:49.8679633Z         torch.manual_seed(2025)
2025-05-07T20:33:49.8679926Z     
2025-05-07T20:33:49.8680249Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.8682428Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:49.8684414Z 
2025-05-07T20:33:49.8684534Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:49.8684748Z 
2025-05-07T20:33:49.8684850Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.8685264Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.8685679Z     T=4096,
2025-05-07T20:33:49.8685858Z     D=7168,
2025-05-07T20:33:49.8686087Z     scale_ub=1200.0,
2025-05-07T20:33:49.8686302Z     contiguous=True,
2025-05-07T20:33:49.8686517Z     compiled=False,
2025-05-07T20:33:49.8686721Z )
2025-05-07T20:33:49.9756076Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.9756783Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:49.9757078Z 
2025-05-07T20:33:49.9757164Z     @given(
2025-05-07T20:33:49.9757398Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.9757716Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.9758029Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.9758365Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.9758697Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.9758989Z     )
2025-05-07T20:33:49.9759341Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.9759791Z     def test_silu_mul_quant(
2025-05-07T20:33:49.9760039Z         self,
2025-05-07T20:33:49.9760238Z         T: int,
2025-05-07T20:33:49.9760433Z         D: int,
2025-05-07T20:33:49.9760743Z         scale_ub: Optional[float],
2025-05-07T20:33:49.9761014Z         contiguous: bool,
2025-05-07T20:33:49.9761242Z         compiled: bool,
2025-05-07T20:33:49.9761466Z     ) -> None:
2025-05-07T20:33:49.9761683Z         torch.manual_seed(2025)
2025-05-07T20:33:49.9761928Z     
2025-05-07T20:33:49.9762197Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.9764389Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:49.9766401Z 
2025-05-07T20:33:49.9766520Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:49.9766737Z 
2025-05-07T20:33:49.9766845Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.9767261Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.9767677Z     T=16384,
2025-05-07T20:33:49.9767872Z     D=7168,
2025-05-07T20:33:49.9768056Z     scale_ub=None,
2025-05-07T20:33:49.9768285Z     contiguous=False,
2025-05-07T20:33:49.9768513Z     compiled=True,
2025-05-07T20:33:49.9768720Z )
2025-05-07T20:33:49.9769044Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.9769604Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:49.9769960Z 
2025-05-07T20:33:49.9770038Z     @given(
2025-05-07T20:33:49.9770265Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.9770574Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.9770882Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.9771211Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.9771547Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.9771831Z     )
2025-05-07T20:33:49.9772184Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.9772640Z     def test_silu_mul_quant(
2025-05-07T20:33:49.9772873Z         self,
2025-05-07T20:33:49.9773062Z         T: int,
2025-05-07T20:33:49.9773256Z         D: int,
2025-05-07T20:33:49.9773466Z         scale_ub: Optional[float],
2025-05-07T20:33:49.9773740Z         contiguous: bool,
2025-05-07T20:33:49.9773975Z         compiled: bool,
2025-05-07T20:33:49.9774187Z     ) -> None:
2025-05-07T20:33:49.9774400Z         torch.manual_seed(2025)
2025-05-07T20:33:49.9774814Z     
2025-05-07T20:33:49.9775083Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.9777265Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:49.9779312Z 
2025-05-07T20:33:49.9779432Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:49.9779655Z 
2025-05-07T20:33:49.9779758Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.9780183Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.9780600Z     T=4096,
2025-05-07T20:33:49.9780792Z     D=7168,
2025-05-07T20:33:49.9780980Z     scale_ub=None,
2025-05-07T20:33:49.9781187Z     contiguous=True,
2025-05-07T20:33:49.9781454Z     compiled=False,
2025-05-07T20:33:49.9781656Z )
2025-05-07T20:33:49.9781972Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.9782483Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:49.9782767Z 
2025-05-07T20:33:49.9782847Z     @given(
2025-05-07T20:33:49.9783076Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.9783391Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.9783703Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.9784042Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.9784378Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.9784680Z     )
2025-05-07T20:33:49.9785037Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.9785487Z     def test_silu_mul_quant(
2025-05-07T20:33:49.9785738Z         self,
2025-05-07T20:33:49.9785939Z         T: int,
2025-05-07T20:33:49.9786145Z         D: int,
2025-05-07T20:33:49.9786370Z         scale_ub: Optional[float],
2025-05-07T20:33:49.9786649Z         contiguous: bool,
2025-05-07T20:33:49.9786892Z         compiled: bool,
2025-05-07T20:33:49.9787108Z     ) -> None:
2025-05-07T20:33:49.9787319Z         torch.manual_seed(2025)
2025-05-07T20:33:49.9787562Z     
2025-05-07T20:33:49.9787834Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.9790028Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:49.9792087Z 
2025-05-07T20:33:49.9792210Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:49.9792431Z 
2025-05-07T20:33:49.9792543Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.9792968Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.9793384Z     T=16384,
2025-05-07T20:33:49.9793574Z     D=7168,
2025-05-07T20:33:49.9793771Z     scale_ub=None,
2025-05-07T20:33:49.9793988Z     contiguous=True,
2025-05-07T20:33:49.9794222Z     compiled=False,
2025-05-07T20:33:49.9794437Z )
2025-05-07T20:33:49.9794758Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.9795275Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:49.9795610Z 
2025-05-07T20:33:49.9795697Z     @given(
2025-05-07T20:33:49.9795930Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.9796257Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.9796575Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.9796958Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.9797290Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.9797580Z     )
2025-05-07T20:33:49.9797933Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.9798386Z     def test_silu_mul_quant(
2025-05-07T20:33:49.9798630Z         self,
2025-05-07T20:33:49.9798821Z         T: int,
2025-05-07T20:33:49.9799015Z         D: int,
2025-05-07T20:33:49.9799231Z         scale_ub: Optional[float],
2025-05-07T20:33:49.9799535Z         contiguous: bool,
2025-05-07T20:33:49.9799797Z         compiled: bool,
2025-05-07T20:33:49.9800012Z     ) -> None:
2025-05-07T20:33:49.9800222Z         torch.manual_seed(2025)
2025-05-07T20:33:49.9800454Z     
2025-05-07T20:33:49.9800721Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.9802944Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:49.9804944Z 
2025-05-07T20:33:49.9805063Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:49.9805282Z 
2025-05-07T20:33:49.9805388Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.9805809Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.9806228Z     T=16384,
2025-05-07T20:33:49.9806417Z     D=7168,
2025-05-07T20:33:49.9806599Z     scale_ub=1200.0,
2025-05-07T20:33:49.9806818Z     contiguous=True,
2025-05-07T20:33:49.9807032Z     compiled=False,
2025-05-07T20:33:49.9807226Z )
2025-05-07T20:33:49.9807540Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:49.9808051Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:49.9808338Z 
2025-05-07T20:33:49.9808420Z     @given(
2025-05-07T20:33:49.9808638Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:49.9808951Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:49.9809257Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:49.9809583Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:49.9809957Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:49.9810244Z     )
2025-05-07T20:33:49.9810582Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:49.9811039Z     def test_silu_mul_quant(
2025-05-07T20:33:49.9811272Z         self,
2025-05-07T20:33:49.9811461Z         T: int,
2025-05-07T20:33:49.9811645Z         D: int,
2025-05-07T20:33:49.9811863Z         scale_ub: Optional[float],
2025-05-07T20:33:49.9812130Z         contiguous: bool,
2025-05-07T20:33:49.9812360Z         compiled: bool,
2025-05-07T20:33:49.9812578Z     ) -> None:
2025-05-07T20:33:49.9812786Z         torch.manual_seed(2025)
2025-05-07T20:33:49.9813021Z     
2025-05-07T20:33:49.9813286Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:49.9815597Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:49.9817630Z 
2025-05-07T20:33:49.9817752Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:49.9817969Z 
2025-05-07T20:33:49.9818073Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:49.9818490Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:49.9818910Z     T=128,
2025-05-07T20:33:49.9819094Z     D=5120,
2025-05-07T20:33:49.9819277Z     scale_ub=1200.0,
2025-05-07T20:33:49.9819499Z     contiguous=False,
2025-05-07T20:33:49.9819718Z     compiled=False,
2025-05-07T20:33:49.9819913Z )
2025-05-07T20:33:50.1122910Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.1124039Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:50.1124619Z 
2025-05-07T20:33:50.1124787Z     @given(
2025-05-07T20:33:50.1125746Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.1126420Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.1127027Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.1127703Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.1128368Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.1128951Z     )
2025-05-07T20:33:50.1129643Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.1130286Z     def test_silu_mul_quant(
2025-05-07T20:33:50.1130541Z         self,
2025-05-07T20:33:50.1130733Z         T: int,
2025-05-07T20:33:50.1130935Z         D: int,
2025-05-07T20:33:50.1131159Z         scale_ub: Optional[float],
2025-05-07T20:33:50.1131438Z         contiguous: bool,
2025-05-07T20:33:50.1131690Z         compiled: bool,
2025-05-07T20:33:50.1131917Z     ) -> None:
2025-05-07T20:33:50.1132133Z         torch.manual_seed(2025)
2025-05-07T20:33:50.1132377Z     
2025-05-07T20:33:50.1132664Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.1133017Z     
2025-05-07T20:33:50.1133215Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.1133514Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.1133826Z         x = x_sign * x_clamp
2025-05-07T20:33:50.1134069Z         x0 = x[:, :D]
2025-05-07T20:33:50.1134290Z         x1 = x[:, D:]
2025-05-07T20:33:50.1134557Z     
2025-05-07T20:33:50.1134744Z         if contiguous:
2025-05-07T20:33:50.1134983Z             x0 = x0.contiguous()
2025-05-07T20:33:50.1135259Z             x1 = x1.contiguous()
2025-05-07T20:33:50.1135504Z     
2025-05-07T20:33:50.1135706Z         if scale_ub is not None:
2025-05-07T20:33:50.1136084Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.1136434Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.1136755Z             )
2025-05-07T20:33:50.1136962Z         else:
2025-05-07T20:33:50.1137182Z             scale_ub_tensor = None
2025-05-07T20:33:50.1137446Z     
2025-05-07T20:33:50.1137698Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.1138025Z             op = silu_mul_quant
2025-05-07T20:33:50.1138285Z             if compiled:
2025-05-07T20:33:50.1138558Z                 op = torch.compile(op)
2025-05-07T20:33:50.1138860Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.1139158Z     
2025-05-07T20:33:50.1139366Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.1139539Z 
2025-05-07T20:33:50.1139664Z moe/activation_test.py:117: 
2025-05-07T20:33:50.1139972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.1140328Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.1140631Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.1141438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.1142192Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.1142770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.1143577Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.1157861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.1158615Z     kernel = self.compile(
2025-05-07T20:33:50.1159205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.1159988Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.1160428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.1160669Z 
2025-05-07T20:33:50.1160893Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1cb3f3b0>
2025-05-07T20:33:50.1162140Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.1163626Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1c9487c0>}
2025-05-07T20:33:50.1165047Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.1166144Z context = <triton._C.libtriton.ir.context object at 0x7f1b1ca9fbf0>
2025-05-07T20:33:50.1166451Z 
2025-05-07T20:33:50.1166637Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.1167185Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.1167692Z                            module_map=module_map)
2025-05-07T20:33:50.1168090Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.1168462Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.1168746Z E       ^
2025-05-07T20:33:50.1169244Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.1169737Z 
2025-05-07T20:33:50.1170220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.1170768Z 
2025-05-07T20:33:50.1170882Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.1171321Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.1171807Z     T=2048,
2025-05-07T20:33:50.1172011Z     D=7168,
2025-05-07T20:33:50.1172226Z     scale_ub=None,
2025-05-07T20:33:50.1172464Z     contiguous=False,
2025-05-07T20:33:50.1172702Z     compiled=False,
2025-05-07T20:33:50.1172935Z )
2025-05-07T20:33:50.1173278Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.1173797Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:50.1174094Z 
2025-05-07T20:33:50.1174180Z     @given(
2025-05-07T20:33:50.1174430Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.1174863Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.1175188Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.1175544Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.1175944Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.1176285Z     )
2025-05-07T20:33:50.1176858Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.1181434Z     def test_silu_mul_quant(
2025-05-07T20:33:50.1181684Z         self,
2025-05-07T20:33:50.1181891Z         T: int,
2025-05-07T20:33:50.1182103Z         D: int,
2025-05-07T20:33:50.1182345Z         scale_ub: Optional[float],
2025-05-07T20:33:50.1182690Z         contiguous: bool,
2025-05-07T20:33:50.1182938Z         compiled: bool,
2025-05-07T20:33:50.1183168Z     ) -> None:
2025-05-07T20:33:50.1183388Z         torch.manual_seed(2025)
2025-05-07T20:33:50.1183636Z     
2025-05-07T20:33:50.1183916Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.1186161Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.1188161Z 
2025-05-07T20:33:50.1188289Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:50.1188508Z 
2025-05-07T20:33:50.1188617Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.1209714Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.1210152Z     T=128,
2025-05-07T20:33:50.1210333Z     D=7168,
2025-05-07T20:33:50.1210521Z     scale_ub=1200.0,
2025-05-07T20:33:50.1210738Z     contiguous=True,
2025-05-07T20:33:50.1210955Z     compiled=True,
2025-05-07T20:33:50.1211152Z )
2025-05-07T20:33:50.1484949Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.1485470Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:50.1485772Z 
2025-05-07T20:33:50.1485857Z     @given(
2025-05-07T20:33:50.1486097Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.1486516Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.1486902Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.1487241Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.1487572Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.1487861Z     )
2025-05-07T20:33:50.1488205Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.1488653Z     def test_silu_mul_quant(
2025-05-07T20:33:50.1488888Z         self,
2025-05-07T20:33:50.1489076Z         T: int,
2025-05-07T20:33:50.1489261Z         D: int,
2025-05-07T20:33:50.1489469Z         scale_ub: Optional[float],
2025-05-07T20:33:50.1489785Z         contiguous: bool,
2025-05-07T20:33:50.1490019Z         compiled: bool,
2025-05-07T20:33:50.1490373Z     ) -> None:
2025-05-07T20:33:50.1490588Z         torch.manual_seed(2025)
2025-05-07T20:33:50.1490836Z     
2025-05-07T20:33:50.1491114Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.1491460Z     
2025-05-07T20:33:50.1491657Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.1491961Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.1492273Z         x = x_sign * x_clamp
2025-05-07T20:33:50.1492521Z         x0 = x[:, :D]
2025-05-07T20:33:50.1492744Z         x1 = x[:, D:]
2025-05-07T20:33:50.1492954Z     
2025-05-07T20:33:50.1493138Z         if contiguous:
2025-05-07T20:33:50.1493378Z             x0 = x0.contiguous()
2025-05-07T20:33:50.1493646Z             x1 = x1.contiguous()
2025-05-07T20:33:50.1493895Z     
2025-05-07T20:33:50.1494096Z         if scale_ub is not None:
2025-05-07T20:33:50.1494383Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.1494824Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.1495148Z             )
2025-05-07T20:33:50.1495424Z         else:
2025-05-07T20:33:50.1495635Z             scale_ub_tensor = None
2025-05-07T20:33:50.1495896Z     
2025-05-07T20:33:50.1496135Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.1496456Z             op = silu_mul_quant
2025-05-07T20:33:50.1496788Z             if compiled:
2025-05-07T20:33:50.1497049Z                 op = torch.compile(op)
2025-05-07T20:33:50.1497354Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.1497649Z     
2025-05-07T20:33:50.1497851Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.1498021Z 
2025-05-07T20:33:50.1498131Z moe/activation_test.py:117: 
2025-05-07T20:33:50.1498433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.1498785Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.1499088Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.1499673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.1500277Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.1501056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.1501796Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.1502364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.1503091Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.1503802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.1504366Z     kernel = self.compile(
2025-05-07T20:33:50.1504941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.1505650Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.1506085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.1506324Z 
2025-05-07T20:33:50.1506543Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1caff680>
2025-05-07T20:33:50.1507678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.1509120Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1c949940>}
2025-05-07T20:33:50.1510541Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.1511682Z context = <triton._C.libtriton.ir.context object at 0x7f1b1ca70f30>
2025-05-07T20:33:50.1511998Z 
2025-05-07T20:33:50.1512172Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.1512723Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.1513219Z                            module_map=module_map)
2025-05-07T20:33:50.1513593Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.1513963Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.1514239Z E       ^
2025-05-07T20:33:50.1514726Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.1515213Z 
2025-05-07T20:33:50.1515653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.1516207Z 
2025-05-07T20:33:50.1516318Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.1516805Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.1517221Z     T=128,
2025-05-07T20:33:50.1517427Z     D=7168,
2025-05-07T20:33:50.1517623Z     scale_ub=1200.0,
2025-05-07T20:33:50.1517843Z     contiguous=True,
2025-05-07T20:33:50.1518078Z     compiled=False,
2025-05-07T20:33:50.1518330Z )
2025-05-07T20:33:50.1518659Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.1519174Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:50.1519466Z 
2025-05-07T20:33:50.1519551Z     @given(
2025-05-07T20:33:50.1519786Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.1520101Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.1520424Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.1520763Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.1521098Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.1521395Z     )
2025-05-07T20:33:50.1521751Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.1522211Z     def test_silu_mul_quant(
2025-05-07T20:33:50.1522446Z         self,
2025-05-07T20:33:50.1522697Z         T: int,
2025-05-07T20:33:50.1522902Z         D: int,
2025-05-07T20:33:50.1523125Z         scale_ub: Optional[float],
2025-05-07T20:33:50.1523401Z         contiguous: bool,
2025-05-07T20:33:50.1523645Z         compiled: bool,
2025-05-07T20:33:50.1523863Z     ) -> None:
2025-05-07T20:33:50.1524089Z         torch.manual_seed(2025)
2025-05-07T20:33:50.1524339Z     
2025-05-07T20:33:50.1524613Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.1524975Z     
2025-05-07T20:33:50.1525172Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.1525704Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.1527880Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.1529879Z 
2025-05-07T20:33:50.1529997Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:50.1530222Z 
2025-05-07T20:33:50.1530325Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.1530750Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.1531160Z     T=128,
2025-05-07T20:33:50.1531354Z     D=5120,
2025-05-07T20:33:50.1531552Z     scale_ub=1200.0,
2025-05-07T20:33:50.1531774Z     contiguous=True,
2025-05-07T20:33:50.1532088Z     compiled=True,
2025-05-07T20:33:50.1532290Z )
2025-05-07T20:33:50.1532613Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.1533124Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:50.1533415Z 
2025-05-07T20:33:50.1533494Z     @given(
2025-05-07T20:33:50.1533736Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.1534047Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.1534360Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.1534748Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.1535078Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.1535372Z     )
2025-05-07T20:33:50.1535824Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.1536298Z     def test_silu_mul_quant(
2025-05-07T20:33:50.1536544Z         self,
2025-05-07T20:33:50.1536760Z         T: int,
2025-05-07T20:33:50.1536952Z         D: int,
2025-05-07T20:33:50.1537285Z         scale_ub: Optional[float],
2025-05-07T20:33:50.1537566Z         contiguous: bool,
2025-05-07T20:33:50.1537803Z         compiled: bool,
2025-05-07T20:33:50.1538035Z     ) -> None:
2025-05-07T20:33:50.1538254Z         torch.manual_seed(2025)
2025-05-07T20:33:50.1538567Z     
2025-05-07T20:33:50.1538839Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.1539193Z     
2025-05-07T20:33:50.1539395Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.1539685Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.1541815Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.1543798Z 
2025-05-07T20:33:50.1543978Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:50.1544205Z 
2025-05-07T20:33:50.1544319Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.1544749Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.1545162Z     T=128,
2025-05-07T20:33:50.1545364Z     D=7168,
2025-05-07T20:33:50.1545567Z     scale_ub=None,
2025-05-07T20:33:50.1545775Z     contiguous=True,
2025-05-07T20:33:50.1546004Z     compiled=True,
2025-05-07T20:33:50.1546212Z )
2025-05-07T20:33:50.4242448Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4242983Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:50.4243273Z 
2025-05-07T20:33:50.4243352Z     @given(
2025-05-07T20:33:50.4245003Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4245317Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4245632Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4245973Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4246304Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4246598Z     )
2025-05-07T20:33:50.4246953Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4247410Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4247647Z         self,
2025-05-07T20:33:50.4247841Z         T: int,
2025-05-07T20:33:50.4248035Z         D: int,
2025-05-07T20:33:50.4248243Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4248514Z         contiguous: bool,
2025-05-07T20:33:50.4248753Z         compiled: bool,
2025-05-07T20:33:50.4248966Z     ) -> None:
2025-05-07T20:33:50.4249294Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4249535Z     
2025-05-07T20:33:50.4249814Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4252053Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.4254051Z 
2025-05-07T20:33:50.4254169Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:50.4254391Z 
2025-05-07T20:33:50.4269993Z FAILED
2025-05-07T20:33:50.4270150Z 
2025-05-07T20:33:50.4270294Z =================================== FAILURES ===================================
2025-05-07T20:33:50.4271117Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:50.4271687Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:50.4272357Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:33:50.4273019Z   |     yield
2025-05-07T20:33:50.4273480Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run
2025-05-07T20:33:50.4274105Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:50.4274771Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
2025-05-07T20:33:50.4275459Z   |     if method() is not None:
2025-05-07T20:33:50.4275754Z   |        ^^^^^^^^
2025-05-07T20:33:50.4276501Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:50.4277552Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4277978Z   |            ^^^^^^^
2025-05-07T20:33:50.4278857Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:50.4279763Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:50.4280414Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:50.4281021Z   +-+---------------- 1 ----------------
2025-05-07T20:33:50.4281421Z     | Traceback (most recent call last):
2025-05-07T20:33:50.4282428Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:50.4283532Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4284055Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:50.4286944Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.4289792Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:50.4290414Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4290985Z     |     T=2048,
2025-05-07T20:33:50.4291299Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:50.4291779Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:50.4292380Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:50.4292901Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:50.4293326Z     | )
2025-05-07T20:33:50.4293582Z     | 
2025-05-07T20:33:50.4294326Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:50.4295319Z     +---------------- 2 ----------------
2025-05-07T20:33:50.4295730Z     | Traceback (most recent call last):
2025-05-07T20:33:50.4296763Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:50.4297889Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4298410Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:50.4301381Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.4304282Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:50.4304879Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4305296Z     |     T=128,
2025-05-07T20:33:50.4305496Z     |     D=7168,
2025-05-07T20:33:50.4305706Z     |     scale_ub=None,
2025-05-07T20:33:50.4305946Z     |     contiguous=True,
2025-05-07T20:33:50.4306182Z     |     compiled=True,
2025-05-07T20:33:50.4306410Z     | )
2025-05-07T20:33:50.4306591Z     | 
2025-05-07T20:33:50.4307127Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:50.4307814Z     +---------------- 3 ----------------
2025-05-07T20:33:50.4308121Z     | Traceback (most recent call last):
2025-05-07T20:33:50.4308860Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:50.4309678Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4310069Z     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:50.4312290Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.4314404Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:50.4314853Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4315277Z     |     T=128,
2025-05-07T20:33:50.4315478Z     |     D=5120,
2025-05-07T20:33:50.4315683Z     |     scale_ub=1200.0,
2025-05-07T20:33:50.4315927Z     |     contiguous=True,
2025-05-07T20:33:50.4316170Z     |     compiled=True,
2025-05-07T20:33:50.4316395Z     | )
2025-05-07T20:33:50.4316584Z     | 
2025-05-07T20:33:50.4317128Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:50.4317821Z     +---------------- 4 ----------------
2025-05-07T20:33:50.4318117Z     | Traceback (most recent call last):
2025-05-07T20:33:50.4318869Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:50.4319623Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:50.4319906Z     |                              ^^^^^^^^
2025-05-07T20:33:50.4320728Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:50.4321767Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4322256Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:50.4323425Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:50.4324751Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:50.4325938Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:50.4327011Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4327827Z     |                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:50.4328786Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:50.4352232Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:50.4353052Z     |                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:50.4353995Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:50.4355027Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:50.4355573Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:50.4356651Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:50.4358248Z     |     fn()
2025-05-07T20:33:50.4359075Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:50.4360338Z     |     self.fn.run(
2025-05-07T20:33:50.4361096Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:50.4361947Z     |     kernel = self.compile(
2025-05-07T20:33:50.4362322Z     |              ^^^^^^^^^^^^^
2025-05-07T20:33:50.4363171Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:50.4364208Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4364768Z     |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:50.4365708Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:50.4366858Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4367545Z     |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:50.4368082Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4368579Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:50.4368967Z     | ^
2025-05-07T20:33:50.4369638Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4370620Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:50.4371177Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:50.4371902Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4372517Z     |     T=1,  # or any other generated value
2025-05-07T20:33:50.4372949Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:50.4373442Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:50.4373989Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:50.4374671Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:50.4375107Z     | )
2025-05-07T20:33:50.4375362Z     | 
2025-05-07T20:33:50.4376119Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:50.4377007Z     +------------------------------------
2025-05-07T20:33:50.4377618Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:50.4378163Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4378748Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4379400Z     T=1,
2025-05-07T20:33:50.4379658Z     D=5120,
2025-05-07T20:33:50.4379915Z     scale_ub=None,
2025-05-07T20:33:50.4380210Z     contiguous=True,
2025-05-07T20:33:50.4380513Z     compiled=True,
2025-05-07T20:33:50.4380796Z )
2025-05-07T20:33:50.4381225Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4381881Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:50.4382235Z 
2025-05-07T20:33:50.4382355Z     @given(
2025-05-07T20:33:50.4382665Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4383111Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4383548Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4384017Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4384498Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4384919Z     )
2025-05-07T20:33:50.4385468Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4386125Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4386467Z         self,
2025-05-07T20:33:50.4386738Z         T: int,
2025-05-07T20:33:50.4387002Z         D: int,
2025-05-07T20:33:50.4387301Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4387679Z         contiguous: bool,
2025-05-07T20:33:50.4388014Z         compiled: bool,
2025-05-07T20:33:50.4388340Z     ) -> None:
2025-05-07T20:33:50.4388646Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4388978Z     
2025-05-07T20:33:50.4389348Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4389846Z     
2025-05-07T20:33:50.4390123Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4390548Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4390999Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4391330Z         x0 = x[:, :D]
2025-05-07T20:33:50.4391636Z         x1 = x[:, D:]
2025-05-07T20:33:50.4391926Z     
2025-05-07T20:33:50.4392189Z         if contiguous:
2025-05-07T20:33:50.4392508Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4392866Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4393198Z     
2025-05-07T20:33:50.4393464Z         if scale_ub is not None:
2025-05-07T20:33:50.4393854Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4394314Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4394711Z             )
2025-05-07T20:33:50.4394973Z         else:
2025-05-07T20:33:50.4395260Z             scale_ub_tensor = None
2025-05-07T20:33:50.4395587Z     
2025-05-07T20:33:50.4395892Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4396378Z             op = silu_mul_quant
2025-05-07T20:33:50.4396708Z             if compiled:
2025-05-07T20:33:50.4397037Z                 op = torch.compile(op)
2025-05-07T20:33:50.4397429Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4397795Z     
2025-05-07T20:33:50.4398057Z         y_fp8, y_scale = fn()
2025-05-07T20:33:50.4398436Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:50.4398843Z     
2025-05-07T20:33:50.4399184Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4399642Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:50.4400041Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:50.4400459Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:50.4400943Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4401368Z     
2025-05-07T20:33:50.4401637Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:50.4401921Z 
2025-05-07T20:33:50.4402058Z moe/activation_test.py:126: 
2025-05-07T20:33:50.4402508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4402952Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:50.4403396Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4404571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:50.4405696Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:50.4406489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4407446Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4408440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:50.4409528Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:50.4410576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:50.4411525Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:50.4412352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:50.4413060Z     fn()
2025-05-07T20:33:50.4413761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:50.4414663Z     self.fn.run(
2025-05-07T20:33:50.4415331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4416060Z     kernel = self.compile(
2025-05-07T20:33:50.4416828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4417786Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4418371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4418725Z 
2025-05-07T20:33:50.4419022Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c10850aa0>
2025-05-07T20:33:50.4420661Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4422698Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c0b065c60>}
2025-05-07T20:33:50.4424678Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4426581Z context = <triton._C.libtriton.ir.context object at 0x7f1c0b2182b0>
2025-05-07T20:33:50.4427011Z 
2025-05-07T20:33:50.4427246Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4428003Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4428630Z                            module_map=module_map)
2025-05-07T20:33:50.4429106Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4429604Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:50.4429975Z E       ^
2025-05-07T20:33:50.4430614Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4431247Z 
2025-05-07T20:33:50.4431812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4432563Z 
2025-05-07T20:33:50.4432707Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4433386Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4433952Z     T=2048,
2025-05-07T20:33:50.4434218Z     D=5120,
2025-05-07T20:33:50.4434496Z     scale_ub=1200.0,
2025-05-07T20:33:50.4434808Z     contiguous=True,
2025-05-07T20:33:50.4435221Z     compiled=False,
2025-05-07T20:33:50.4435518Z )
2025-05-07T20:33:50.4435964Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4436678Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:50.4437086Z 
2025-05-07T20:33:50.4437199Z     @given(
2025-05-07T20:33:50.4437529Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4437964Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4438401Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4438882Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4439370Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4439798Z     )
2025-05-07T20:33:50.4440308Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4441031Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4441387Z         self,
2025-05-07T20:33:50.4441675Z         T: int,
2025-05-07T20:33:50.4441956Z         D: int,
2025-05-07T20:33:50.4442287Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4442686Z         contiguous: bool,
2025-05-07T20:33:50.4443024Z         compiled: bool,
2025-05-07T20:33:50.4443351Z     ) -> None:
2025-05-07T20:33:50.4443670Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4444041Z     
2025-05-07T20:33:50.4444422Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4444929Z     
2025-05-07T20:33:50.4445212Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4445621Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4446079Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4446430Z         x0 = x[:, :D]
2025-05-07T20:33:50.4446737Z         x1 = x[:, D:]
2025-05-07T20:33:50.4447046Z     
2025-05-07T20:33:50.4447319Z         if contiguous:
2025-05-07T20:33:50.4447655Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4448041Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4448403Z     
2025-05-07T20:33:50.4448678Z         if scale_ub is not None:
2025-05-07T20:33:50.4449073Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4449564Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4450008Z             )
2025-05-07T20:33:50.4450296Z         else:
2025-05-07T20:33:50.4451048Z             scale_ub_tensor = None
2025-05-07T20:33:50.4451408Z     
2025-05-07T20:33:50.4451733Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4452176Z             op = silu_mul_quant
2025-05-07T20:33:50.4452515Z             if compiled:
2025-05-07T20:33:50.4452957Z                 op = torch.compile(op)
2025-05-07T20:33:50.4453371Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4453767Z     
2025-05-07T20:33:50.4454022Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.4454241Z 
2025-05-07T20:33:50.4454385Z moe/activation_test.py:117: 
2025-05-07T20:33:50.4454901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4455360Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.4455752Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4456728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.4457687Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.4458412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4459346Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4460398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4461700Z     kernel = self.compile(
2025-05-07T20:33:50.4462695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4463703Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4464274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4464610Z 
2025-05-07T20:33:50.4464897Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c3d3d74a0>
2025-05-07T20:33:50.4466446Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4468459Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c0aebc220>}
2025-05-07T20:33:50.4470580Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4472087Z context = <triton._C.libtriton.ir.context object at 0x7f1c0b237870>
2025-05-07T20:33:50.4472504Z 
2025-05-07T20:33:50.4472747Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4473500Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4474155Z                            module_map=module_map)
2025-05-07T20:33:50.4474658Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4475161Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.4475510Z E       ^
2025-05-07T20:33:50.4476164Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4476822Z 
2025-05-07T20:33:50.4477432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4478149Z 
2025-05-07T20:33:50.4478298Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4478841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4479384Z     T=2048,
2025-05-07T20:33:50.4479635Z     D=5120,
2025-05-07T20:33:50.4479887Z     scale_ub=1200.0,
2025-05-07T20:33:50.4480199Z     contiguous=True,
2025-05-07T20:33:50.4480494Z     compiled=True,
2025-05-07T20:33:50.4480761Z )
2025-05-07T20:33:50.4481182Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4481838Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:50.4482295Z 
2025-05-07T20:33:50.4482424Z     @given(
2025-05-07T20:33:50.4482756Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4483218Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4483674Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4484130Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4484577Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4484961Z     )
2025-05-07T20:33:50.4485419Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4486023Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4486363Z         self,
2025-05-07T20:33:50.4486623Z         T: int,
2025-05-07T20:33:50.4486884Z         D: int,
2025-05-07T20:33:50.4487172Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4487526Z         contiguous: bool,
2025-05-07T20:33:50.4487848Z         compiled: bool,
2025-05-07T20:33:50.4488142Z     ) -> None:
2025-05-07T20:33:50.4488414Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4488744Z     
2025-05-07T20:33:50.4489166Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4489652Z     
2025-05-07T20:33:50.4489939Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4490337Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4490807Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4491118Z         x0 = x[:, :D]
2025-05-07T20:33:50.4491410Z         x1 = x[:, D:]
2025-05-07T20:33:50.4491688Z     
2025-05-07T20:33:50.4491932Z         if contiguous:
2025-05-07T20:33:50.4492253Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4492607Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4492916Z     
2025-05-07T20:33:50.4493173Z         if scale_ub is not None:
2025-05-07T20:33:50.4493543Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4493988Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4494405Z             )
2025-05-07T20:33:50.4494751Z         else:
2025-05-07T20:33:50.4495025Z             scale_ub_tensor = None
2025-05-07T20:33:50.4495372Z     
2025-05-07T20:33:50.4495683Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4496107Z             op = silu_mul_quant
2025-05-07T20:33:50.4496489Z             if compiled:
2025-05-07T20:33:50.4496822Z                 op = torch.compile(op)
2025-05-07T20:33:50.4497225Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4497597Z     
2025-05-07T20:33:50.4497856Z         y_fp8, y_scale = fn()
2025-05-07T20:33:50.4498235Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:50.4498619Z     
2025-05-07T20:33:50.4498936Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4499388Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:50.4499772Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:50.4500259Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:50.4500777Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4501250Z     
2025-05-07T20:33:50.4501540Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:50.4501847Z 
2025-05-07T20:33:50.4501994Z moe/activation_test.py:126: 
2025-05-07T20:33:50.4502443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4502934Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:50.4503403Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4504556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:50.4505664Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:50.4506462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4507454Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4508503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:50.4509551Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:50.4510611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:50.4511542Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:50.4512420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:50.4513159Z     fn()
2025-05-07T20:33:50.4513875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:50.4514761Z     self.fn.run(
2025-05-07T20:33:50.4515446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4516263Z     kernel = self.compile(
2025-05-07T20:33:50.4517102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4518054Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4518613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4519003Z 
2025-05-07T20:33:50.4519294Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c0aedcb30>
2025-05-07T20:33:50.4520902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4522904Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c0aebd8a0>}
2025-05-07T20:33:50.4524865Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4526752Z context = <triton._C.libtriton.ir.context object at 0x7f1c09a72830>
2025-05-07T20:33:50.4527161Z 
2025-05-07T20:33:50.4527408Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4528202Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4528902Z                            module_map=module_map)
2025-05-07T20:33:50.4529436Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4529946Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:50.4530328Z E       ^
2025-05-07T20:33:50.4530982Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4531652Z 
2025-05-07T20:33:50.4532260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4532999Z 
2025-05-07T20:33:50.4533151Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4533727Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4534289Z     T=16384,
2025-05-07T20:33:50.4534638Z     D=7168,
2025-05-07T20:33:50.4534910Z     scale_ub=1200.0,
2025-05-07T20:33:50.4535221Z     contiguous=False,
2025-05-07T20:33:50.4535547Z     compiled=False,
2025-05-07T20:33:50.4535845Z )
2025-05-07T20:33:50.4536291Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4537004Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:50.4537402Z 
2025-05-07T20:33:50.4537515Z     @given(
2025-05-07T20:33:50.4537821Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4538354Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4538791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4539243Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4539700Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4540155Z     )
2025-05-07T20:33:50.4540642Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4541276Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4541633Z         self,
2025-05-07T20:33:50.4541920Z         T: int,
2025-05-07T20:33:50.4542201Z         D: int,
2025-05-07T20:33:50.4542520Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4542913Z         contiguous: bool,
2025-05-07T20:33:50.4543252Z         compiled: bool,
2025-05-07T20:33:50.4543564Z     ) -> None:
2025-05-07T20:33:50.4543865Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4544198Z     
2025-05-07T20:33:50.4544571Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4545061Z     
2025-05-07T20:33:50.4545405Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4545805Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4546236Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4546576Z         x0 = x[:, :D]
2025-05-07T20:33:50.4546876Z         x1 = x[:, D:]
2025-05-07T20:33:50.4547250Z     
2025-05-07T20:33:50.4547510Z         if contiguous:
2025-05-07T20:33:50.4547827Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4548189Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4548529Z     
2025-05-07T20:33:50.4548791Z         if scale_ub is not None:
2025-05-07T20:33:50.4549175Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4549640Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4550062Z             )
2025-05-07T20:33:50.4550330Z         else:
2025-05-07T20:33:50.4550615Z             scale_ub_tensor = None
2025-05-07T20:33:50.4550960Z     
2025-05-07T20:33:50.4551282Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4551721Z             op = silu_mul_quant
2025-05-07T20:33:50.4552060Z             if compiled:
2025-05-07T20:33:50.4552410Z                 op = torch.compile(op)
2025-05-07T20:33:50.4552881Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4553260Z     
2025-05-07T20:33:50.4553531Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.4553767Z 
2025-05-07T20:33:50.4553905Z moe/activation_test.py:117: 
2025-05-07T20:33:50.4554320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4554801Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.4555212Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4556229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.4557220Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.4557987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4558978Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4559957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4571197Z     kernel = self.compile(
2025-05-07T20:33:50.4572020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4572966Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4573528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4573861Z 
2025-05-07T20:33:50.4574155Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c09c02de0>
2025-05-07T20:33:50.4575823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4577853Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c09d487c0>}
2025-05-07T20:33:50.4579783Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4581286Z context = <triton._C.libtriton.ir.context object at 0x7f1c09aac4b0>
2025-05-07T20:33:50.4581685Z 
2025-05-07T20:33:50.4581915Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4582559Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4583051Z                            module_map=module_map)
2025-05-07T20:33:50.4583436Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4583867Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.4584145Z E       ^
2025-05-07T20:33:50.4584635Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4585183Z 
2025-05-07T20:33:50.4585629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4586174Z 
2025-05-07T20:33:50.4586284Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4586722Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4587151Z     T=1,
2025-05-07T20:33:50.4587344Z     D=7168,
2025-05-07T20:33:50.4587541Z     scale_ub=None,
2025-05-07T20:33:50.4587764Z     contiguous=True,
2025-05-07T20:33:50.4587998Z     compiled=True,
2025-05-07T20:33:50.4588202Z )
2025-05-07T20:33:50.4588538Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4589051Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:50.4589321Z 
2025-05-07T20:33:50.4589402Z     @given(
2025-05-07T20:33:50.4589695Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4590020Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4590333Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4590682Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4591026Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4591330Z     )
2025-05-07T20:33:50.4591684Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4592152Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4592406Z         self,
2025-05-07T20:33:50.4592598Z         T: int,
2025-05-07T20:33:50.4592807Z         D: int,
2025-05-07T20:33:50.4593037Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4593317Z         contiguous: bool,
2025-05-07T20:33:50.4593575Z         compiled: bool,
2025-05-07T20:33:50.4593805Z     ) -> None:
2025-05-07T20:33:50.4594016Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4594263Z     
2025-05-07T20:33:50.4594550Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4594903Z     
2025-05-07T20:33:50.4595104Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4595400Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4595711Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4595955Z         x0 = x[:, :D]
2025-05-07T20:33:50.4596175Z         x1 = x[:, D:]
2025-05-07T20:33:50.4596379Z     
2025-05-07T20:33:50.4596570Z         if contiguous:
2025-05-07T20:33:50.4596810Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4597075Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4597315Z     
2025-05-07T20:33:50.4597512Z         if scale_ub is not None:
2025-05-07T20:33:50.4597846Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4598183Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4598502Z             )
2025-05-07T20:33:50.4598704Z         else:
2025-05-07T20:33:50.4598911Z             scale_ub_tensor = None
2025-05-07T20:33:50.4599172Z     
2025-05-07T20:33:50.4599408Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4599732Z             op = silu_mul_quant
2025-05-07T20:33:50.4599991Z             if compiled:
2025-05-07T20:33:50.4600243Z                 op = torch.compile(op)
2025-05-07T20:33:50.4600541Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4600822Z     
2025-05-07T20:33:50.4601016Z         y_fp8, y_scale = fn()
2025-05-07T20:33:50.4601294Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:50.4601594Z     
2025-05-07T20:33:50.4601836Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4602184Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:50.4602486Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:50.4602856Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:50.4603233Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4603550Z     
2025-05-07T20:33:50.4603756Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:50.4604006Z 
2025-05-07T20:33:50.4604114Z moe/activation_test.py:126: 
2025-05-07T20:33:50.4604410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4604765Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:50.4605109Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4605942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:50.4606733Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:50.4607310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4608040Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4608817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:50.4609586Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:50.4610363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:50.4611044Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:50.4611675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:50.4612230Z     fn()
2025-05-07T20:33:50.4612767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:50.4613392Z     self.fn.run(
2025-05-07T20:33:50.4613877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4614436Z     kernel = self.compile(
2025-05-07T20:33:50.4615132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4615819Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4616233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4616473Z 
2025-05-07T20:33:50.4616685Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c09c00c80>
2025-05-07T20:33:50.4617813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4619309Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c09d7a840>}
2025-05-07T20:33:50.4620723Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4621825Z context = <triton._C.libtriton.ir.context object at 0x7f1c093a76b0>
2025-05-07T20:33:50.4622129Z 
2025-05-07T20:33:50.4622308Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4622855Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4623338Z                            module_map=module_map)
2025-05-07T20:33:50.4623717Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4624095Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:50.4624376Z E       ^
2025-05-07T20:33:50.4624908Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4625581Z 
2025-05-07T20:33:50.4626088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4626741Z 
2025-05-07T20:33:50.4626861Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4627289Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4627716Z     T=4096,
2025-05-07T20:33:50.4627914Z     D=5120,
2025-05-07T20:33:50.4628107Z     scale_ub=None,
2025-05-07T20:33:50.4628329Z     contiguous=False,
2025-05-07T20:33:50.4628563Z     compiled=False,
2025-05-07T20:33:50.4628768Z )
2025-05-07T20:33:50.4629099Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4629619Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:50.4629908Z 
2025-05-07T20:33:50.4629997Z     @given(
2025-05-07T20:33:50.4630228Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4630553Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4630944Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4631282Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4631627Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4631928Z     )
2025-05-07T20:33:50.4632283Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4632748Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4633000Z         self,
2025-05-07T20:33:50.4633202Z         T: int,
2025-05-07T20:33:50.4633406Z         D: int,
2025-05-07T20:33:50.4633626Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4633907Z         contiguous: bool,
2025-05-07T20:33:50.4634144Z         compiled: bool,
2025-05-07T20:33:50.4634372Z     ) -> None:
2025-05-07T20:33:50.4634595Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4634830Z     
2025-05-07T20:33:50.4635112Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4635470Z     
2025-05-07T20:33:50.4635660Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4635957Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4636278Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4636513Z         x0 = x[:, :D]
2025-05-07T20:33:50.4636734Z         x1 = x[:, D:]
2025-05-07T20:33:50.4636950Z     
2025-05-07T20:33:50.4637130Z         if contiguous:
2025-05-07T20:33:50.4637361Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4637625Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4637865Z     
2025-05-07T20:33:50.4638059Z         if scale_ub is not None:
2025-05-07T20:33:50.4638336Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4638671Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4639074Z             )
2025-05-07T20:33:50.4639269Z         else:
2025-05-07T20:33:50.4639487Z             scale_ub_tensor = None
2025-05-07T20:33:50.4639774Z     
2025-05-07T20:33:50.4640016Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4640340Z             op = silu_mul_quant
2025-05-07T20:33:50.4640594Z             if compiled:
2025-05-07T20:33:50.4640847Z                 op = torch.compile(op)
2025-05-07T20:33:50.4641147Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4641420Z     
2025-05-07T20:33:50.4641613Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.4641778Z 
2025-05-07T20:33:50.4641883Z moe/activation_test.py:117: 
2025-05-07T20:33:50.4642173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4642513Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.4642799Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4643517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.4644313Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.4644870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4645587Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4646315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4646415Z     kernel = self.compile(
2025-05-07T20:33:50.4646815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4646999Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4647128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4647133Z 
2025-05-07T20:33:50.4647341Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c0af99880>
2025-05-07T20:33:50.4648206Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4648724Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c09d7bec0>}
2025-05-07T20:33:50.4649524Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4649723Z context = <triton._C.libtriton.ir.context object at 0x7f1c08dcd770>
2025-05-07T20:33:50.4649728Z 
2025-05-07T20:33:50.4649897Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4650178Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4650291Z                            module_map=module_map)
2025-05-07T20:33:50.4650455Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4650570Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.4650648Z E       ^
2025-05-07T20:33:50.4651035Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4651040Z 
2025-05-07T20:33:50.4651477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4651481Z 
2025-05-07T20:33:50.4651595Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4651826Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4651909Z     T=4096,
2025-05-07T20:33:50.4652000Z     D=7168,
2025-05-07T20:33:50.4652083Z     scale_ub=None,
2025-05-07T20:33:50.4652227Z     contiguous=False,
2025-05-07T20:33:50.4652317Z     compiled=False,
2025-05-07T20:33:50.4652394Z )
2025-05-07T20:33:50.4652626Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4652810Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:50.4652816Z 
2025-05-07T20:33:50.4652899Z     @given(
2025-05-07T20:33:50.4653026Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4653130Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4653247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4653372Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4653488Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4653572Z     )
2025-05-07T20:33:50.4653826Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4653922Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4654011Z         self,
2025-05-07T20:33:50.4654091Z         T: int,
2025-05-07T20:33:50.4654240Z         D: int,
2025-05-07T20:33:50.4654351Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4654442Z         contiguous: bool,
2025-05-07T20:33:50.4654604Z         compiled: bool,
2025-05-07T20:33:50.4654694Z     ) -> None:
2025-05-07T20:33:50.4654834Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4654906Z     
2025-05-07T20:33:50.4655085Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4655160Z     
2025-05-07T20:33:50.4655250Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4655384Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4655472Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4655556Z         x0 = x[:, :D]
2025-05-07T20:33:50.4655635Z         x1 = x[:, D:]
2025-05-07T20:33:50.4655708Z     
2025-05-07T20:33:50.4655794Z         if contiguous:
2025-05-07T20:33:50.4655885Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4655977Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4656054Z     
2025-05-07T20:33:50.4656147Z         if scale_ub is not None:
2025-05-07T20:33:50.4656251Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4656436Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4656516Z             )
2025-05-07T20:33:50.4656597Z         else:
2025-05-07T20:33:50.4656700Z             scale_ub_tensor = None
2025-05-07T20:33:50.4656777Z     
2025-05-07T20:33:50.4656912Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4657001Z             op = silu_mul_quant
2025-05-07T20:33:50.4657083Z             if compiled:
2025-05-07T20:33:50.4657185Z                 op = torch.compile(op)
2025-05-07T20:33:50.4657290Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4657359Z     
2025-05-07T20:33:50.4657457Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.4657461Z 
2025-05-07T20:33:50.4657556Z moe/activation_test.py:117: 
2025-05-07T20:33:50.4657696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4657801Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.4657902Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4658436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.4658537Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.4658914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4659149Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4659507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4659600Z     kernel = self.compile(
2025-05-07T20:33:50.4660005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4660231Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4660368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4660372Z 
2025-05-07T20:33:50.4660583Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c09efd010>
2025-05-07T20:33:50.4661390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4661910Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c0915ca40>}
2025-05-07T20:33:50.4662700Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4662942Z context = <triton._C.libtriton.ir.context object at 0x7f1c09694f30>
2025-05-07T20:33:50.4662947Z 
2025-05-07T20:33:50.4663120Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4663400Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4663548Z                            module_map=module_map)
2025-05-07T20:33:50.4663711Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4663818Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.4663895Z E       ^
2025-05-07T20:33:50.4664263Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4664268Z 
2025-05-07T20:33:50.4664710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4664717Z 
2025-05-07T20:33:50.4664822Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4665057Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4665135Z     T=128,
2025-05-07T20:33:50.4665253Z     D=7168,
2025-05-07T20:33:50.4665342Z     scale_ub=None,
2025-05-07T20:33:50.4665432Z     contiguous=False,
2025-05-07T20:33:50.4665514Z     compiled=True,
2025-05-07T20:33:50.4665591Z )
2025-05-07T20:33:50.4665812Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4665984Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:50.4665994Z 
2025-05-07T20:33:50.4666074Z     @given(
2025-05-07T20:33:50.4666194Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4666301Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4666414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4666535Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4666658Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4666730Z     )
2025-05-07T20:33:50.4666983Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4667085Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4667160Z         self,
2025-05-07T20:33:50.4667243Z         T: int,
2025-05-07T20:33:50.4667332Z         D: int,
2025-05-07T20:33:50.4667434Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4667534Z         contiguous: bool,
2025-05-07T20:33:50.4667621Z         compiled: bool,
2025-05-07T20:33:50.4667697Z     ) -> None:
2025-05-07T20:33:50.4667804Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4667878Z     
2025-05-07T20:33:50.4668047Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4668133Z     
2025-05-07T20:33:50.4668225Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4668349Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4668490Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4668572Z         x0 = x[:, :D]
2025-05-07T20:33:50.4668650Z         x1 = x[:, D:]
2025-05-07T20:33:50.4668730Z     
2025-05-07T20:33:50.4668810Z         if contiguous:
2025-05-07T20:33:50.4668911Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4669003Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4669074Z     
2025-05-07T20:33:50.4669170Z         if scale_ub is not None:
2025-05-07T20:33:50.4669274Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4669408Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4669492Z             )
2025-05-07T20:33:50.4669566Z         else:
2025-05-07T20:33:50.4669661Z             scale_ub_tensor = None
2025-05-07T20:33:50.4669746Z     
2025-05-07T20:33:50.4669876Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4669968Z             op = silu_mul_quant
2025-05-07T20:33:50.4670062Z             if compiled:
2025-05-07T20:33:50.4670161Z                 op = torch.compile(op)
2025-05-07T20:33:50.4670319Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4670395Z     
2025-05-07T20:33:50.4670488Z         y_fp8, y_scale = fn()
2025-05-07T20:33:50.4670615Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:50.4670728Z     
2025-05-07T20:33:50.4670865Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4670972Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:50.4671068Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:50.4671188Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:50.4671336Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4671410Z     
2025-05-07T20:33:50.4671508Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:50.4671518Z 
2025-05-07T20:33:50.4671616Z moe/activation_test.py:126: 
2025-05-07T20:33:50.4671742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4671857Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:50.4671989Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4672615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:50.4672725Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:50.4673101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4673332Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4673715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:50.4673975Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:50.4674378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:50.4674547Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:50.4674907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:50.4674996Z     fn()
2025-05-07T20:33:50.4675415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:50.4675507Z     self.fn.run(
2025-05-07T20:33:50.4675860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4675952Z     kernel = self.compile(
2025-05-07T20:33:50.4676353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4676528Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4676701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4676715Z 
2025-05-07T20:33:50.4676923Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c09efedb0>
2025-05-07T20:33:50.4677733Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4678257Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c09d48360>}
2025-05-07T20:33:50.4679043Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4679242Z context = <triton._C.libtriton.ir.context object at 0x7f1c09471430>
2025-05-07T20:33:50.4679249Z 
2025-05-07T20:33:50.4679458Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4679730Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4679842Z                            module_map=module_map)
2025-05-07T20:33:50.4680044Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4680152Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:50.4680233Z E       ^
2025-05-07T20:33:50.4680599Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4680604Z 
2025-05-07T20:33:50.4681042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4681046Z 
2025-05-07T20:33:50.4681149Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4681375Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4681467Z     T=128,
2025-05-07T20:33:50.4681544Z     D=7168,
2025-05-07T20:33:50.4681631Z     scale_ub=None,
2025-05-07T20:33:50.4681716Z     contiguous=False,
2025-05-07T20:33:50.4681840Z     compiled=False,
2025-05-07T20:33:50.4681927Z )
2025-05-07T20:33:50.4682155Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4682334Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:50.4682339Z 
2025-05-07T20:33:50.4682427Z     @given(
2025-05-07T20:33:50.4682549Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4682652Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4682780Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4682899Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4683026Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4683108Z     )
2025-05-07T20:33:50.4683366Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4683468Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4683554Z         self,
2025-05-07T20:33:50.4683636Z         T: int,
2025-05-07T20:33:50.4683727Z         D: int,
2025-05-07T20:33:50.4683830Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4683926Z         contiguous: bool,
2025-05-07T20:33:50.4684021Z         compiled: bool,
2025-05-07T20:33:50.4684100Z     ) -> None:
2025-05-07T20:33:50.4684200Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4684287Z     
2025-05-07T20:33:50.4684461Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4684545Z     
2025-05-07T20:33:50.4684641Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4684769Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4684865Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4684947Z         x0 = x[:, :D]
2025-05-07T20:33:50.4685104Z         x1 = x[:, D:]
2025-05-07T20:33:50.4685192Z     
2025-05-07T20:33:50.4685281Z         if contiguous:
2025-05-07T20:33:50.4685376Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4685474Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4685549Z     
2025-05-07T20:33:50.4685645Z         if scale_ub is not None:
2025-05-07T20:33:50.4685759Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4685896Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4685980Z             )
2025-05-07T20:33:50.4686058Z         else:
2025-05-07T20:33:50.4686154Z             scale_ub_tensor = None
2025-05-07T20:33:50.4686234Z     
2025-05-07T20:33:50.4686364Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4686452Z             op = silu_mul_quant
2025-05-07T20:33:50.4686546Z             if compiled:
2025-05-07T20:33:50.4686644Z                 op = torch.compile(op)
2025-05-07T20:33:50.4686748Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4686831Z     
2025-05-07T20:33:50.4686962Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.4686967Z 
2025-05-07T20:33:50.4687062Z moe/activation_test.py:117: 
2025-05-07T20:33:50.4687203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4687305Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.4687450Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4687977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.4688074Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.4688455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4688683Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4689039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4689140Z     kernel = self.compile(
2025-05-07T20:33:50.4689544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4689767Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4689901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4689906Z 
2025-05-07T20:33:50.4690146Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c09d8bce0>
2025-05-07T20:33:50.4690979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4691494Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c097c1940>}
2025-05-07T20:33:50.4692294Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4692490Z context = <triton._C.libtriton.ir.context object at 0x7f1c094af630>
2025-05-07T20:33:50.4692496Z 
2025-05-07T20:33:50.4692673Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4692945Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4693053Z                            module_map=module_map)
2025-05-07T20:33:50.4693221Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4693318Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.4693392Z E       ^
2025-05-07T20:33:50.4693766Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4693832Z 
2025-05-07T20:33:50.4694272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4694277Z 
2025-05-07T20:33:50.4694393Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4694730Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4694816Z     T=4096,
2025-05-07T20:33:50.4694902Z     D=5120,
2025-05-07T20:33:50.4694988Z     scale_ub=1200.0,
2025-05-07T20:33:50.4695074Z     contiguous=True,
2025-05-07T20:33:50.4695166Z     compiled=False,
2025-05-07T20:33:50.4695243Z )
2025-05-07T20:33:50.4695470Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4695659Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:50.4695663Z 
2025-05-07T20:33:50.4695746Z     @given(
2025-05-07T20:33:50.4695874Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4695979Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4696148Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4696273Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4696390Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4696465Z     )
2025-05-07T20:33:50.4696764Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4696864Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4696950Z         self,
2025-05-07T20:33:50.4697028Z         T: int,
2025-05-07T20:33:50.4697107Z         D: int,
2025-05-07T20:33:50.4697213Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4697305Z         contiguous: bool,
2025-05-07T20:33:50.4697392Z         compiled: bool,
2025-05-07T20:33:50.4697480Z     ) -> None:
2025-05-07T20:33:50.4697573Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4697648Z     
2025-05-07T20:33:50.4697825Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4697900Z     
2025-05-07T20:33:50.4697993Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4698126Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4698212Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4698334Z         x0 = x[:, :D]
2025-05-07T20:33:50.4698424Z         x1 = x[:, D:]
2025-05-07T20:33:50.4698497Z     
2025-05-07T20:33:50.4698588Z         if contiguous:
2025-05-07T20:33:50.4698682Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4698771Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4698850Z     
2025-05-07T20:33:50.4698939Z         if scale_ub is not None:
2025-05-07T20:33:50.4699046Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4699183Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4699256Z             )
2025-05-07T20:33:50.4699335Z         else:
2025-05-07T20:33:50.4699433Z             scale_ub_tensor = None
2025-05-07T20:33:50.4699508Z     
2025-05-07T20:33:50.4699637Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4699734Z             op = silu_mul_quant
2025-05-07T20:33:50.4699817Z             if compiled:
2025-05-07T20:33:50.4699945Z                 op = torch.compile(op)
2025-05-07T20:33:50.4700063Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4700152Z     
2025-05-07T20:33:50.4700248Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.4700252Z 
2025-05-07T20:33:50.4700348Z moe/activation_test.py:117: 
2025-05-07T20:33:50.4700477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4700582Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.4700681Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4701200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.4701302Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.4701729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4701962Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4702323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4702420Z     kernel = self.compile(
2025-05-07T20:33:50.4702830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4703008Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4703144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4703149Z 
2025-05-07T20:33:50.4703358Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c09d8b140>
2025-05-07T20:33:50.4704209Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4704738Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c097c2a20>}
2025-05-07T20:33:50.4705568Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4705769Z context = <triton._C.libtriton.ir.context object at 0x7f1c087b2df0>
2025-05-07T20:33:50.4705774Z 
2025-05-07T20:33:50.4705945Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4706217Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4706333Z                            module_map=module_map)
2025-05-07T20:33:50.4706503Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4706613Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.4706693Z E       ^
2025-05-07T20:33:50.4707101Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4707109Z 
2025-05-07T20:33:50.4707554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4707559Z 
2025-05-07T20:33:50.4707667Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4707902Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4707983Z     T=1,
2025-05-07T20:33:50.4708063Z     D=5120,
2025-05-07T20:33:50.4708153Z     scale_ub=None,
2025-05-07T20:33:50.4708237Z     contiguous=True,
2025-05-07T20:33:50.4708316Z     compiled=True,
2025-05-07T20:33:50.4708393Z )
2025-05-07T20:33:50.4708618Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4708781Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:50.4708786Z 
2025-05-07T20:33:50.4708865Z     @given(
2025-05-07T20:33:50.4708988Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4709096Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4709213Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4709333Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4709453Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4709546Z     )
2025-05-07T20:33:50.4709833Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4709933Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4710012Z         self,
2025-05-07T20:33:50.4710091Z         T: int,
2025-05-07T20:33:50.4710177Z         D: int,
2025-05-07T20:33:50.4710326Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4710418Z         contiguous: bool,
2025-05-07T20:33:50.4710512Z         compiled: bool,
2025-05-07T20:33:50.4710588Z     ) -> None:
2025-05-07T20:33:50.4710684Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4710757Z     
2025-05-07T20:33:50.4710926Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4711008Z     
2025-05-07T20:33:50.4711099Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4711220Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4711316Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4711394Z         x0 = x[:, :D]
2025-05-07T20:33:50.4711476Z         x1 = x[:, D:]
2025-05-07T20:33:50.4711556Z     
2025-05-07T20:33:50.4711639Z         if contiguous:
2025-05-07T20:33:50.4711731Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4711827Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4711904Z     
2025-05-07T20:33:50.4711993Z         if scale_ub is not None:
2025-05-07T20:33:50.4712108Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4717435Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4717528Z             )
2025-05-07T20:33:50.4717618Z         else:
2025-05-07T20:33:50.4717716Z             scale_ub_tensor = None
2025-05-07T20:33:50.4717792Z     
2025-05-07T20:33:50.4717995Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4718090Z             op = silu_mul_quant
2025-05-07T20:33:50.4718180Z             if compiled:
2025-05-07T20:33:50.4718283Z                 op = torch.compile(op)
2025-05-07T20:33:50.4718389Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4718463Z     
2025-05-07T20:33:50.4718556Z         y_fp8, y_scale = fn()
2025-05-07T20:33:50.4718679Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:50.4718757Z     
2025-05-07T20:33:50.4718898Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4719003Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:50.4719110Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:50.4719235Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:50.4719453Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4719546Z     
2025-05-07T20:33:50.4719662Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:50.4719668Z 
2025-05-07T20:33:50.4719791Z moe/activation_test.py:126: 
2025-05-07T20:33:50.4719924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4720030Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:50.4720174Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4720768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:50.4720875Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:50.4721263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4721496Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4721897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:50.4722164Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:50.4722562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:50.4722738Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:50.4723103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:50.4723190Z     fn()
2025-05-07T20:33:50.4723616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:50.4723747Z     self.fn.run(
2025-05-07T20:33:50.4724113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4724209Z     kernel = self.compile(
2025-05-07T20:33:50.4724619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4724814Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4724945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4724950Z 
2025-05-07T20:33:50.4725169Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c093d2f30>
2025-05-07T20:33:50.4726244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4726868Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c097c37e0>}
2025-05-07T20:33:50.4727666Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4727926Z context = <triton._C.libtriton.ir.context object at 0x7f1c087d15b0>
2025-05-07T20:33:50.4727930Z 
2025-05-07T20:33:50.4728103Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4728381Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4728498Z                            module_map=module_map)
2025-05-07T20:33:50.4728665Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4728775Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:50.4728861Z E       ^
2025-05-07T20:33:50.4729238Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4729243Z 
2025-05-07T20:33:50.4729745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4729760Z 
2025-05-07T20:33:50.4729868Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4730099Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4730190Z     T=2048,
2025-05-07T20:33:50.4730266Z     D=5120,
2025-05-07T20:33:50.4730351Z     scale_ub=None,
2025-05-07T20:33:50.4730450Z     contiguous=True,
2025-05-07T20:33:50.4730534Z     compiled=True,
2025-05-07T20:33:50.4730613Z )
2025-05-07T20:33:50.4730844Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4731020Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:50.4731026Z 
2025-05-07T20:33:50.4731118Z     @given(
2025-05-07T20:33:50.4731241Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4731343Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4731467Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4731588Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4731704Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4731789Z     )
2025-05-07T20:33:50.4732043Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4732139Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4732221Z         self,
2025-05-07T20:33:50.4732301Z         T: int,
2025-05-07T20:33:50.4732382Z         D: int,
2025-05-07T20:33:50.4732485Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4732576Z         contiguous: bool,
2025-05-07T20:33:50.4732674Z         compiled: bool,
2025-05-07T20:33:50.4732820Z     ) -> None:
2025-05-07T20:33:50.4732918Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4732999Z     
2025-05-07T20:33:50.4733173Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4733254Z     
2025-05-07T20:33:50.4733357Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4733482Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4733578Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4733665Z         x0 = x[:, :D]
2025-05-07T20:33:50.4733753Z         x1 = x[:, D:]
2025-05-07T20:33:50.4733827Z     
2025-05-07T20:33:50.4733915Z         if contiguous:
2025-05-07T20:33:50.4734011Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4734105Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4734179Z     
2025-05-07T20:33:50.4734272Z         if scale_ub is not None:
2025-05-07T20:33:50.4734382Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4734583Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4734664Z             )
2025-05-07T20:33:50.4734746Z         else:
2025-05-07T20:33:50.4734888Z             scale_ub_tensor = None
2025-05-07T20:33:50.4734965Z     
2025-05-07T20:33:50.4735100Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4735196Z             op = silu_mul_quant
2025-05-07T20:33:50.4735323Z             if compiled:
2025-05-07T20:33:50.4735426Z                 op = torch.compile(op)
2025-05-07T20:33:50.4735534Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4735611Z     
2025-05-07T20:33:50.4735703Z         y_fp8, y_scale = fn()
2025-05-07T20:33:50.4735824Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:50.4735903Z     
2025-05-07T20:33:50.4736041Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4736144Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:50.4736250Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:50.4736373Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:50.4736520Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4736605Z     
2025-05-07T20:33:50.4736705Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:50.4736710Z 
2025-05-07T20:33:50.4736854Z moe/activation_test.py:126: 
2025-05-07T20:33:50.4736991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4737100Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:50.4737241Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4737831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:50.4737934Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:50.4738319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4738551Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4738954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:50.4739224Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:50.4739627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:50.4739802Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:50.4740160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:50.4740243Z     fn()
2025-05-07T20:33:50.4740670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:50.4740757Z     self.fn.run(
2025-05-07T20:33:50.4741121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4741263Z     kernel = self.compile(
2025-05-07T20:33:50.4741663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4741850Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4741982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4741987Z 
2025-05-07T20:33:50.4742205Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c08dac260>
2025-05-07T20:33:50.4743015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4743535Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c0954de40>}
2025-05-07T20:33:50.4744381Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4744579Z context = <triton._C.libtriton.ir.context object at 0x7f1c08842e70>
2025-05-07T20:33:50.4744623Z 
2025-05-07T20:33:50.4744800Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4745075Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4745186Z                            module_map=module_map)
2025-05-07T20:33:50.4745354Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4745456Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:50.4745544Z E       ^
2025-05-07T20:33:50.4745910Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4745918Z 
2025-05-07T20:33:50.4746356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4746361Z 
2025-05-07T20:33:50.4746514Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4746746Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4746830Z     T=128,
2025-05-07T20:33:50.4746912Z     D=5120,
2025-05-07T20:33:50.4746998Z     scale_ub=None,
2025-05-07T20:33:50.4747095Z     contiguous=True,
2025-05-07T20:33:50.4747178Z     compiled=True,
2025-05-07T20:33:50.4747253Z )
2025-05-07T20:33:50.4747481Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4747649Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:50.4747654Z 
2025-05-07T20:33:50.4747737Z     @given(
2025-05-07T20:33:50.4747863Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4747966Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4748082Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4748202Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4748318Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4748402Z     )
2025-05-07T20:33:50.4748656Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4748749Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4748834Z         self,
2025-05-07T20:33:50.4748912Z         T: int,
2025-05-07T20:33:50.4748991Z         D: int,
2025-05-07T20:33:50.4749093Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4749185Z         contiguous: bool,
2025-05-07T20:33:50.4749268Z         compiled: bool,
2025-05-07T20:33:50.4749354Z     ) -> None:
2025-05-07T20:33:50.4749449Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4749525Z     
2025-05-07T20:33:50.4749752Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4749827Z     
2025-05-07T20:33:50.4749928Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4750053Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4750147Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4750233Z         x0 = x[:, :D]
2025-05-07T20:33:50.4750317Z         x1 = x[:, D:]
2025-05-07T20:33:50.4750389Z     
2025-05-07T20:33:50.4750475Z         if contiguous:
2025-05-07T20:33:50.4750567Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4750655Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4750730Z     
2025-05-07T20:33:50.4750823Z         if scale_ub is not None:
2025-05-07T20:33:50.4750932Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4751069Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4751144Z             )
2025-05-07T20:33:50.4751219Z         else:
2025-05-07T20:33:50.4751319Z             scale_ub_tensor = None
2025-05-07T20:33:50.4751396Z     
2025-05-07T20:33:50.4751577Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4751668Z             op = silu_mul_quant
2025-05-07T20:33:50.4751755Z             if compiled:
2025-05-07T20:33:50.4751863Z                 op = torch.compile(op)
2025-05-07T20:33:50.4751973Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4752111Z     
2025-05-07T20:33:50.4752206Z         y_fp8, y_scale = fn()
2025-05-07T20:33:50.4752328Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:50.4752404Z     
2025-05-07T20:33:50.4752545Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4752648Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:50.4752758Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:50.4752882Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:50.4753021Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4753106Z     
2025-05-07T20:33:50.4753209Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:50.4753216Z 
2025-05-07T20:33:50.4753314Z moe/activation_test.py:126: 
2025-05-07T20:33:50.4753450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4753601Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:50.4753740Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4754335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:50.4754438Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:50.4754823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4755053Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4755437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:50.4755714Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:50.4756110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:50.4756282Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:50.4756641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:50.4756719Z     fn()
2025-05-07T20:33:50.4757146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:50.4757231Z     self.fn.run(
2025-05-07T20:33:50.4757584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4757681Z     kernel = self.compile(
2025-05-07T20:33:50.4758081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4758310Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4758442Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4758446Z 
2025-05-07T20:33:50.4758655Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c09517e90>
2025-05-07T20:33:50.4759464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4759979Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c085a2ac0>}
2025-05-07T20:33:50.4760810Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4761009Z context = <triton._C.libtriton.ir.context object at 0x7f1c08f47130>
2025-05-07T20:33:50.4761013Z 
2025-05-07T20:33:50.4761191Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4761506Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4761614Z                            module_map=module_map)
2025-05-07T20:33:50.4761779Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4761884Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:50.4761963Z E       ^
2025-05-07T20:33:50.4762343Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4762348Z 
2025-05-07T20:33:50.4762782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4762789Z 
2025-05-07T20:33:50.4762900Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4763130Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4763255Z     T=4096,
2025-05-07T20:33:50.4763341Z     D=5120,
2025-05-07T20:33:50.4763431Z     scale_ub=None,
2025-05-07T20:33:50.4763517Z     contiguous=True,
2025-05-07T20:33:50.4763606Z     compiled=True,
2025-05-07T20:33:50.4763679Z )
2025-05-07T20:33:50.4763904Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4764086Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:50.4764090Z 
2025-05-07T20:33:50.4764167Z     @given(
2025-05-07T20:33:50.4764291Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4764397Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4764515Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4764644Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4764760Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4764837Z     )
2025-05-07T20:33:50.4765101Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4765196Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4765274Z         self,
2025-05-07T20:33:50.4765362Z         T: int,
2025-05-07T20:33:50.4765443Z         D: int,
2025-05-07T20:33:50.4765548Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4765641Z         contiguous: bool,
2025-05-07T20:33:50.4765728Z         compiled: bool,
2025-05-07T20:33:50.4765814Z     ) -> None:
2025-05-07T20:33:50.4765906Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4765976Z     
2025-05-07T20:33:50.4766152Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4766224Z     
2025-05-07T20:33:50.4766312Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4766488Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4766578Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4766659Z         x0 = x[:, :D]
2025-05-07T20:33:50.4766739Z         x1 = x[:, D:]
2025-05-07T20:33:50.4766809Z     
2025-05-07T20:33:50.4766897Z         if contiguous:
2025-05-07T20:33:50.4766991Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4767078Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4767152Z     
2025-05-07T20:33:50.4767241Z         if scale_ub is not None:
2025-05-07T20:33:50.4767345Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4767480Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4767553Z             )
2025-05-07T20:33:50.4767629Z         else:
2025-05-07T20:33:50.4767723Z             scale_ub_tensor = None
2025-05-07T20:33:50.4767792Z     
2025-05-07T20:33:50.4767919Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4768009Z             op = silu_mul_quant
2025-05-07T20:33:50.4768095Z             if compiled:
2025-05-07T20:33:50.4768238Z                 op = torch.compile(op)
2025-05-07T20:33:50.4768346Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4768418Z     
2025-05-07T20:33:50.4768513Z         y_fp8, y_scale = fn()
2025-05-07T20:33:50.4768634Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:50.4768750Z     
2025-05-07T20:33:50.4768886Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4768983Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:50.4769081Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:50.4769208Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:50.4769343Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4769416Z     
2025-05-07T20:33:50.4769522Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:50.4769526Z 
2025-05-07T20:33:50.4769622Z moe/activation_test.py:126: 
2025-05-07T20:33:50.4769764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4769864Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:50.4769993Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4770631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:50.4770736Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:50.4771112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4771346Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4771727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:50.4771996Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:50.4772391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:50.4772556Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:50.4772917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:50.4772992Z     fn()
2025-05-07T20:33:50.4773415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:50.4773495Z     self.fn.run(
2025-05-07T20:33:50.4773848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4773944Z     kernel = self.compile(
2025-05-07T20:33:50.4774344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4774600Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4774787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4774791Z 
2025-05-07T20:33:50.4775000Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c08a004d0>
2025-05-07T20:33:50.4775818Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4776334Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c0853b4c0>}
2025-05-07T20:33:50.4777124Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4777322Z context = <triton._C.libtriton.ir.context object at 0x7f1c08bf6df0>
2025-05-07T20:33:50.4777366Z 
2025-05-07T20:33:50.4777535Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4777819Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4777973Z                            module_map=module_map)
2025-05-07T20:33:50.4778135Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4778243Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:50.4778322Z E       ^
2025-05-07T20:33:50.4778702Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4778706Z 
2025-05-07T20:33:50.4779146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4779150Z 
2025-05-07T20:33:50.4779257Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4779492Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4779573Z     T=16384,
2025-05-07T20:33:50.4779657Z     D=5120,
2025-05-07T20:33:50.4779738Z     scale_ub=None,
2025-05-07T20:33:50.4779866Z     contiguous=True,
2025-05-07T20:33:50.4779956Z     compiled=True,
2025-05-07T20:33:50.4780033Z )
2025-05-07T20:33:50.4780257Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4780439Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:50.4780443Z 
2025-05-07T20:33:50.4780525Z     @given(
2025-05-07T20:33:50.4780645Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4780751Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4780866Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4780984Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4781108Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4781183Z     )
2025-05-07T20:33:50.4781446Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4781542Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4781622Z         self,
2025-05-07T20:33:50.4781704Z         T: int,
2025-05-07T20:33:50.4781779Z         D: int,
2025-05-07T20:33:50.4781884Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4781972Z         contiguous: bool,
2025-05-07T20:33:50.4782057Z         compiled: bool,
2025-05-07T20:33:50.4782141Z     ) -> None:
2025-05-07T20:33:50.4782237Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4782311Z     
2025-05-07T20:33:50.4782486Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4782563Z     
2025-05-07T20:33:50.4782660Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4782784Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4782874Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4783003Z         x0 = x[:, :D]
2025-05-07T20:33:50.4783084Z         x1 = x[:, D:]
2025-05-07T20:33:50.4783157Z     
2025-05-07T20:33:50.4783245Z         if contiguous:
2025-05-07T20:33:50.4783338Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4783431Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4783508Z     
2025-05-07T20:33:50.4783600Z         if scale_ub is not None:
2025-05-07T20:33:50.4783707Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4783847Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4783923Z             )
2025-05-07T20:33:50.4784002Z         else:
2025-05-07T20:33:50.4784095Z             scale_ub_tensor = None
2025-05-07T20:33:50.4784167Z     
2025-05-07T20:33:50.4784301Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4784391Z             op = silu_mul_quant
2025-05-07T20:33:50.4784478Z             if compiled:
2025-05-07T20:33:50.4784581Z                 op = torch.compile(op)
2025-05-07T20:33:50.4784690Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4784765Z     
2025-05-07T20:33:50.4784928Z         y_fp8, y_scale = fn()
2025-05-07T20:33:50.4785050Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:50.4785119Z     
2025-05-07T20:33:50.4785263Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4785408Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:50.4785516Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:50.4785637Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:50.4785778Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4785858Z     
2025-05-07T20:33:50.4785959Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:50.4785964Z 
2025-05-07T20:33:50.4786060Z moe/activation_test.py:126: 
2025-05-07T20:33:50.4786193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4786303Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:50.4786442Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4787086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:50.4787191Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:50.4787576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4787803Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4788185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:50.4788451Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:50.4788847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:50.4789032Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:50.4789393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:50.4789470Z     fn()
2025-05-07T20:33:50.4789947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:50.4790032Z     self.fn.run(
2025-05-07T20:33:50.4790387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4790488Z     kernel = self.compile(
2025-05-07T20:33:50.4790886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4791069Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4791200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4791252Z 
2025-05-07T20:33:50.4791465Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c0953d400>
2025-05-07T20:33:50.4792288Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4792811Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1df41580>}
2025-05-07T20:33:50.4793602Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4793798Z context = <triton._C.libtriton.ir.context object at 0x7f1c080b7a30>
2025-05-07T20:33:50.4793803Z 
2025-05-07T20:33:50.4793975Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4794293Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4794402Z                            module_map=module_map)
2025-05-07T20:33:50.4794572Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4794718Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:50.4794793Z E       ^
2025-05-07T20:33:50.4795165Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4795169Z 
2025-05-07T20:33:50.4795601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4795606Z 
2025-05-07T20:33:50.4795715Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4795942Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4796020Z     T=1,
2025-05-07T20:33:50.4796102Z     D=5120,
2025-05-07T20:33:50.4796185Z     scale_ub=1200.0,
2025-05-07T20:33:50.4796277Z     contiguous=True,
2025-05-07T20:33:50.4796360Z     compiled=True,
2025-05-07T20:33:50.4796431Z )
2025-05-07T20:33:50.4796696Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4796869Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:50.4796877Z 
2025-05-07T20:33:50.4796955Z     @given(
2025-05-07T20:33:50.4797080Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4797184Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4797299Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4797422Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4797539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4797619Z     )
2025-05-07T20:33:50.4797875Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4797973Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4798057Z         self,
2025-05-07T20:33:50.4798136Z         T: int,
2025-05-07T20:33:50.4798216Z         D: int,
2025-05-07T20:33:50.4798321Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4798412Z         contiguous: bool,
2025-05-07T20:33:50.4798497Z         compiled: bool,
2025-05-07T20:33:50.4798588Z     ) -> None:
2025-05-07T20:33:50.4798683Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4798755Z     
2025-05-07T20:33:50.4798934Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4799009Z     
2025-05-07T20:33:50.4799101Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4799234Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4799324Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4799405Z         x0 = x[:, :D]
2025-05-07T20:33:50.4799493Z         x1 = x[:, D:]
2025-05-07T20:33:50.4799582Z     
2025-05-07T20:33:50.4799732Z         if contiguous:
2025-05-07T20:33:50.4799837Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4799927Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4800001Z     
2025-05-07T20:33:50.4800092Z         if scale_ub is not None:
2025-05-07T20:33:50.4800200Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4800336Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4800418Z             )
2025-05-07T20:33:50.4800492Z         else:
2025-05-07T20:33:50.4800591Z             scale_ub_tensor = None
2025-05-07T20:33:50.4800663Z     
2025-05-07T20:33:50.4800792Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4800883Z             op = silu_mul_quant
2025-05-07T20:33:50.4800965Z             if compiled:
2025-05-07T20:33:50.4801068Z                 op = torch.compile(op)
2025-05-07T20:33:50.4801175Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4801251Z     
2025-05-07T20:33:50.4801345Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.4801352Z 
2025-05-07T20:33:50.4801450Z moe/activation_test.py:117: 
2025-05-07T20:33:50.4801627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4801734Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.4801837Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4802222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.4802362Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.4802884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.4802993Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.4803367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4803594Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4803962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4804055Z     kernel = self.compile(
2025-05-07T20:33:50.4804501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4804683Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4804813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4804817Z 
2025-05-07T20:33:50.4805030Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1df669f0>
2025-05-07T20:33:50.4805836Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4806352Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1da20680>}
2025-05-07T20:33:50.4807147Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4807344Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d7ba630>
2025-05-07T20:33:50.4807348Z 
2025-05-07T20:33:50.4807522Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4807791Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4807901Z                            module_map=module_map)
2025-05-07T20:33:50.4808063Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4808162Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.4808242Z E       ^
2025-05-07T20:33:50.4808658Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4808662Z 
2025-05-07T20:33:50.4809099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4809108Z 
2025-05-07T20:33:50.4809214Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4809443Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4809527Z     T=1,
2025-05-07T20:33:50.4809605Z     D=5120,
2025-05-07T20:33:50.4809683Z     scale_ub=None,
2025-05-07T20:33:50.4809771Z     contiguous=False,
2025-05-07T20:33:50.4809853Z     compiled=True,
2025-05-07T20:33:50.4809938Z )
2025-05-07T20:33:50.4810200Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4810365Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:50.4810369Z 
2025-05-07T20:33:50.4810455Z     @given(
2025-05-07T20:33:50.4810573Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4810715Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4810832Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4810953Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4811066Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4811187Z     )
2025-05-07T20:33:50.4811438Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4811529Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4811611Z         self,
2025-05-07T20:33:50.4811687Z         T: int,
2025-05-07T20:33:50.4811763Z         D: int,
2025-05-07T20:33:50.4811868Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4811957Z         contiguous: bool,
2025-05-07T20:33:50.4812046Z         compiled: bool,
2025-05-07T20:33:50.4812124Z     ) -> None:
2025-05-07T20:33:50.4812219Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4812297Z     
2025-05-07T20:33:50.4812470Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4812546Z     
2025-05-07T20:33:50.4812645Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4812810Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4812903Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4812994Z         x0 = x[:, :D]
2025-05-07T20:33:50.4813075Z         x1 = x[:, D:]
2025-05-07T20:33:50.4813145Z     
2025-05-07T20:33:50.4813235Z         if contiguous:
2025-05-07T20:33:50.4813326Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4813421Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4813494Z     
2025-05-07T20:33:50.4813585Z         if scale_ub is not None:
2025-05-07T20:33:50.4813694Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4813827Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4813904Z             )
2025-05-07T20:33:50.4813983Z         else:
2025-05-07T20:33:50.4814079Z             scale_ub_tensor = None
2025-05-07T20:33:50.4814149Z     
2025-05-07T20:33:50.4814286Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4814376Z             op = silu_mul_quant
2025-05-07T20:33:50.4814462Z             if compiled:
2025-05-07T20:33:50.4814642Z                 op = torch.compile(op)
2025-05-07T20:33:50.4814751Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4814828Z     
2025-05-07T20:33:50.4814919Z         y_fp8, y_scale = fn()
2025-05-07T20:33:50.4815043Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:50.4815122Z     
2025-05-07T20:33:50.4815257Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4815361Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:50.4815462Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:50.4815583Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:50.4815723Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4815872Z     
2025-05-07T20:33:50.4815977Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:50.4815982Z 
2025-05-07T20:33:50.4816086Z moe/activation_test.py:126: 
2025-05-07T20:33:50.4816222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4816328Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:50.4816464Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4817052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:50.4817152Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:50.4817530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4817757Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4818189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:50.4818453Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:50.4818849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:50.4819063Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:50.4819421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:50.4819498Z     fn()
2025-05-07T20:33:50.4819926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:50.4820008Z     self.fn.run(
2025-05-07T20:33:50.4820368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4820465Z     kernel = self.compile(
2025-05-07T20:33:50.4820868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4821052Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4821224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4821232Z 
2025-05-07T20:33:50.4821445Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c08134650>
2025-05-07T20:33:50.4822252Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4822768Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1da22b60>}
2025-05-07T20:33:50.4823569Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4823769Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d79a9f0>
2025-05-07T20:33:50.4823773Z 
2025-05-07T20:33:50.4823946Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4824217Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4824323Z                            module_map=module_map)
2025-05-07T20:33:50.4824490Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4824596Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:50.4824679Z E       ^
2025-05-07T20:33:50.4825050Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4825055Z 
2025-05-07T20:33:50.4825761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4825769Z 
2025-05-07T20:33:50.4825905Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4826134Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4826215Z     T=1,
2025-05-07T20:33:50.4826295Z     D=5120,
2025-05-07T20:33:50.4826376Z     scale_ub=None,
2025-05-07T20:33:50.4826466Z     contiguous=True,
2025-05-07T20:33:50.4826554Z     compiled=False,
2025-05-07T20:33:50.4826626Z )
2025-05-07T20:33:50.4826852Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4827018Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:50.4827022Z 
2025-05-07T20:33:50.4827099Z     @given(
2025-05-07T20:33:50.4827222Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4827323Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4827440Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4827656Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4827772Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4827854Z     )
2025-05-07T20:33:50.4828110Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4828263Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4828345Z         self,
2025-05-07T20:33:50.4828424Z         T: int,
2025-05-07T20:33:50.4828499Z         D: int,
2025-05-07T20:33:50.4828599Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4828689Z         contiguous: bool,
2025-05-07T20:33:50.4828772Z         compiled: bool,
2025-05-07T20:33:50.4828853Z     ) -> None:
2025-05-07T20:33:50.4828949Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4829020Z     
2025-05-07T20:33:50.4829197Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4829272Z     
2025-05-07T20:33:50.4829371Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4829497Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4829606Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4829692Z         x0 = x[:, :D]
2025-05-07T20:33:50.4829857Z         x1 = x[:, D:]
2025-05-07T20:33:50.4829934Z     
2025-05-07T20:33:50.4830023Z         if contiguous:
2025-05-07T20:33:50.4830115Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4830205Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4830282Z     
2025-05-07T20:33:50.4830372Z         if scale_ub is not None:
2025-05-07T20:33:50.4830475Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4830613Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4830687Z             )
2025-05-07T20:33:50.4830770Z         else:
2025-05-07T20:33:50.4830866Z             scale_ub_tensor = None
2025-05-07T20:33:50.4830942Z     
2025-05-07T20:33:50.4831071Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4831164Z             op = silu_mul_quant
2025-05-07T20:33:50.4831253Z             if compiled:
2025-05-07T20:33:50.4831357Z                 op = torch.compile(op)
2025-05-07T20:33:50.4831463Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4831545Z     
2025-05-07T20:33:50.4831641Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.4831648Z 
2025-05-07T20:33:50.4831745Z moe/activation_test.py:117: 
2025-05-07T20:33:50.4831876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4831983Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.4832083Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4832615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.4832713Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.4833085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4833420Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4833778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4833874Z     kernel = self.compile(
2025-05-07T20:33:50.4834280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4834456Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4834590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4834594Z 
2025-05-07T20:33:50.4834801Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1da46f90>
2025-05-07T20:33:50.4835607Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4836182Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1da239c0>}
2025-05-07T20:33:50.4836970Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4837210Z context = <triton._C.libtriton.ir.context object at 0x7f1c0843fa70>
2025-05-07T20:33:50.4837215Z 
2025-05-07T20:33:50.4837384Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4837661Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4837767Z                            module_map=module_map)
2025-05-07T20:33:50.4837928Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4838034Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.4838112Z E       ^
2025-05-07T20:33:50.4838519Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4838524Z 
2025-05-07T20:33:50.4838966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4838973Z 
2025-05-07T20:33:50.4839076Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4839310Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4839389Z     T=128,
2025-05-07T20:33:50.4839467Z     D=5120,
2025-05-07T20:33:50.4839556Z     scale_ub=None,
2025-05-07T20:33:50.4839639Z     contiguous=False,
2025-05-07T20:33:50.4839725Z     compiled=True,
2025-05-07T20:33:50.4839802Z )
2025-05-07T20:33:50.4840027Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4840212Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:50.4840217Z 
2025-05-07T20:33:50.4840294Z     @given(
2025-05-07T20:33:50.4840412Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4840517Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4840635Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4840757Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4840874Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4840948Z     )
2025-05-07T20:33:50.4841201Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4841300Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4841379Z         self,
2025-05-07T20:33:50.4841462Z         T: int,
2025-05-07T20:33:50.4841539Z         D: int,
2025-05-07T20:33:50.4841638Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4841730Z         contiguous: bool,
2025-05-07T20:33:50.4841860Z         compiled: bool,
2025-05-07T20:33:50.4841942Z     ) -> None:
2025-05-07T20:33:50.4842038Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4842112Z     
2025-05-07T20:33:50.4842284Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4842364Z     
2025-05-07T20:33:50.4842671Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4842799Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4842892Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4842971Z         x0 = x[:, :D]
2025-05-07T20:33:50.4843060Z         x1 = x[:, D:]
2025-05-07T20:33:50.4843133Z     
2025-05-07T20:33:50.4843216Z         if contiguous:
2025-05-07T20:33:50.4843309Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4843398Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4843473Z     
2025-05-07T20:33:50.4843566Z         if scale_ub is not None:
2025-05-07T20:33:50.4843672Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4843813Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4848242Z             )
2025-05-07T20:33:50.4848340Z         else:
2025-05-07T20:33:50.4848445Z             scale_ub_tensor = None
2025-05-07T20:33:50.4848521Z     
2025-05-07T20:33:50.4848660Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4848801Z             op = silu_mul_quant
2025-05-07T20:33:50.4848892Z             if compiled:
2025-05-07T20:33:50.4848999Z                 op = torch.compile(op)
2025-05-07T20:33:50.4849108Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4849187Z     
2025-05-07T20:33:50.4849282Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.4849287Z 
2025-05-07T20:33:50.4849387Z moe/activation_test.py:117: 
2025-05-07T20:33:50.4849526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4849634Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.4849737Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4850136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.4850233Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.4850825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.4850930Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.4851307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4851538Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4851896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4851993Z     kernel = self.compile(
2025-05-07T20:33:50.4852399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4852586Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4852722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4852727Z 
2025-05-07T20:33:50.4852942Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c08ff97c0>
2025-05-07T20:33:50.4853753Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4854273Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1da20a40>}
2025-05-07T20:33:50.4855130Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4855381Z context = <triton._C.libtriton.ir.context object at 0x7f1c084b9630>
2025-05-07T20:33:50.4855386Z 
2025-05-07T20:33:50.4855555Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4855841Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4855955Z                            module_map=module_map)
2025-05-07T20:33:50.4856120Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4856226Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.4856305Z E       ^
2025-05-07T20:33:50.4856677Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4856682Z 
2025-05-07T20:33:50.4857119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4857127Z 
2025-05-07T20:33:50.4857233Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4857509Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4857592Z     T=128,
2025-05-07T20:33:50.4857672Z     D=7168,
2025-05-07T20:33:50.4857767Z     scale_ub=1200.0,
2025-05-07T20:33:50.4857858Z     contiguous=False,
2025-05-07T20:33:50.4857985Z     compiled=False,
2025-05-07T20:33:50.4858066Z )
2025-05-07T20:33:50.4858292Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4858472Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:50.4858479Z 
2025-05-07T20:33:50.4858560Z     @given(
2025-05-07T20:33:50.4858685Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4858793Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4858910Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4859031Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4859157Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4859235Z     )
2025-05-07T20:33:50.4859492Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4859636Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4859717Z         self,
2025-05-07T20:33:50.4859799Z         T: int,
2025-05-07T20:33:50.4859885Z         D: int,
2025-05-07T20:33:50.4859990Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4860090Z         contiguous: bool,
2025-05-07T20:33:50.4860179Z         compiled: bool,
2025-05-07T20:33:50.4860261Z     ) -> None:
2025-05-07T20:33:50.4860363Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4860441Z     
2025-05-07T20:33:50.4860617Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4860698Z     
2025-05-07T20:33:50.4860792Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4860919Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4861019Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4861107Z         x0 = x[:, :D]
2025-05-07T20:33:50.4861191Z         x1 = x[:, D:]
2025-05-07T20:33:50.4861269Z     
2025-05-07T20:33:50.4861355Z         if contiguous:
2025-05-07T20:33:50.4861452Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4861546Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4861625Z     
2025-05-07T20:33:50.4861723Z         if scale_ub is not None:
2025-05-07T20:33:50.4861831Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4861967Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4862049Z             )
2025-05-07T20:33:50.4862129Z         else:
2025-05-07T20:33:50.4862228Z             scale_ub_tensor = None
2025-05-07T20:33:50.4862307Z     
2025-05-07T20:33:50.4862438Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4862530Z             op = silu_mul_quant
2025-05-07T20:33:50.4862623Z             if compiled:
2025-05-07T20:33:50.4862771Z                 op = torch.compile(op)
2025-05-07T20:33:50.4862882Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4862960Z     
2025-05-07T20:33:50.4863054Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.4863058Z 
2025-05-07T20:33:50.4863163Z moe/activation_test.py:117: 
2025-05-07T20:33:50.4863301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4863409Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.4863517Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4864043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.4864145Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.4864527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4864755Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4865164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4865263Z     kernel = self.compile(
2025-05-07T20:33:50.4865669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4865892Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4866023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4866027Z 
2025-05-07T20:33:50.4866244Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1da45c40>
2025-05-07T20:33:50.4867052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4867572Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c08538a40>}
2025-05-07T20:33:50.4868404Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4868604Z context = <triton._C.libtriton.ir.context object at 0x7f1c08425a30>
2025-05-07T20:33:50.4868609Z 
2025-05-07T20:33:50.4868783Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4869057Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4869169Z                            module_map=module_map)
2025-05-07T20:33:50.4869336Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4869449Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.4869550Z E       ^
2025-05-07T20:33:50.4869948Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4869953Z 
2025-05-07T20:33:50.4870390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4870397Z 
2025-05-07T20:33:50.4870508Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4870739Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4870821Z     T=128,
2025-05-07T20:33:50.4870900Z     D=5120,
2025-05-07T20:33:50.4870986Z     scale_ub=None,
2025-05-07T20:33:50.4871080Z     contiguous=False,
2025-05-07T20:33:50.4871166Z     compiled=False,
2025-05-07T20:33:50.4871244Z )
2025-05-07T20:33:50.4871475Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4871651Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:50.4871699Z 
2025-05-07T20:33:50.4871782Z     @given(
2025-05-07T20:33:50.4871910Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4872013Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4872135Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4872259Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4872377Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4872457Z     )
2025-05-07T20:33:50.4872710Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4872809Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4872894Z         self,
2025-05-07T20:33:50.4872975Z         T: int,
2025-05-07T20:33:50.4873054Z         D: int,
2025-05-07T20:33:50.4873162Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4873255Z         contiguous: bool,
2025-05-07T20:33:50.4873345Z         compiled: bool,
2025-05-07T20:33:50.4873429Z     ) -> None:
2025-05-07T20:33:50.4873530Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4873609Z     
2025-05-07T20:33:50.4873834Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4873913Z     
2025-05-07T20:33:50.4874014Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4874143Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4874272Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4874360Z         x0 = x[:, :D]
2025-05-07T20:33:50.4874446Z         x1 = x[:, D:]
2025-05-07T20:33:50.4874526Z     
2025-05-07T20:33:50.4874616Z         if contiguous:
2025-05-07T20:33:50.4874713Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4874807Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4874887Z     
2025-05-07T20:33:50.4874981Z         if scale_ub is not None:
2025-05-07T20:33:50.4875090Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4875231Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4875314Z             )
2025-05-07T20:33:50.4875399Z         else:
2025-05-07T20:33:50.4875497Z             scale_ub_tensor = None
2025-05-07T20:33:50.4875573Z     
2025-05-07T20:33:50.4875709Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4875846Z             op = silu_mul_quant
2025-05-07T20:33:50.4875935Z             if compiled:
2025-05-07T20:33:50.4876044Z                 op = torch.compile(op)
2025-05-07T20:33:50.4876151Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4876227Z     
2025-05-07T20:33:50.4876324Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.4876329Z 
2025-05-07T20:33:50.4876429Z moe/activation_test.py:117: 
2025-05-07T20:33:50.4876564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4876672Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.4876773Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4877303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.4877410Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.4877786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4878028Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4878389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4878489Z     kernel = self.compile(
2025-05-07T20:33:50.4878894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4879072Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4879208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4879213Z 
2025-05-07T20:33:50.4879423Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c08509be0>
2025-05-07T20:33:50.4880284Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4880801Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1da2c400>}
2025-05-07T20:33:50.4881594Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4881792Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d505030>
2025-05-07T20:33:50.4881797Z 
2025-05-07T20:33:50.4881967Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4882247Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4882398Z                            module_map=module_map)
2025-05-07T20:33:50.4882566Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4882672Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.4882751Z E       ^
2025-05-07T20:33:50.4883185Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4883194Z 
2025-05-07T20:33:50.4883632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4883637Z 
2025-05-07T20:33:50.4883741Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4883973Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4884052Z     T=128,
2025-05-07T20:33:50.4884133Z     D=5120,
2025-05-07T20:33:50.4884226Z     scale_ub=1200.0,
2025-05-07T20:33:50.4884317Z     contiguous=True,
2025-05-07T20:33:50.4884403Z     compiled=False,
2025-05-07T20:33:50.4884484Z )
2025-05-07T20:33:50.4884708Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4884929Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:50.4884937Z 
2025-05-07T20:33:50.4885018Z     @given(
2025-05-07T20:33:50.4885139Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4885246Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4885365Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4885483Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4885606Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4885682Z     )
2025-05-07T20:33:50.4885935Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4886032Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4886115Z         self,
2025-05-07T20:33:50.4886195Z         T: int,
2025-05-07T20:33:50.4886279Z         D: int,
2025-05-07T20:33:50.4886381Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4886478Z         contiguous: bool,
2025-05-07T20:33:50.4886565Z         compiled: bool,
2025-05-07T20:33:50.4886652Z     ) -> None:
2025-05-07T20:33:50.4886755Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4886830Z     
2025-05-07T20:33:50.4887003Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4887084Z     
2025-05-07T20:33:50.4887181Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4887308Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4887401Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4887482Z         x0 = x[:, :D]
2025-05-07T20:33:50.4887570Z         x1 = x[:, D:]
2025-05-07T20:33:50.4887646Z     
2025-05-07T20:33:50.4887733Z         if contiguous:
2025-05-07T20:33:50.4887830Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4887968Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4888044Z     
2025-05-07T20:33:50.4888146Z         if scale_ub is not None:
2025-05-07T20:33:50.4888255Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4888393Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4888475Z             )
2025-05-07T20:33:50.4888558Z         else:
2025-05-07T20:33:50.4888654Z             scale_ub_tensor = None
2025-05-07T20:33:50.4888731Z     
2025-05-07T20:33:50.4888866Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4888963Z             op = silu_mul_quant
2025-05-07T20:33:50.4889054Z             if compiled:
2025-05-07T20:33:50.4889156Z                 op = torch.compile(op)
2025-05-07T20:33:50.4889270Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4889345Z     
2025-05-07T20:33:50.4889437Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.4889441Z 
2025-05-07T20:33:50.4889542Z moe/activation_test.py:117: 
2025-05-07T20:33:50.4889722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4889828Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.4889936Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4890465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.4890610Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.4890987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4891216Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4891578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4891676Z     kernel = self.compile(
2025-05-07T20:33:50.4892081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4892270Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4892402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4892406Z 
2025-05-07T20:33:50.4892659Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c0850b710>
2025-05-07T20:33:50.4893472Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4893990Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1da2d300>}
2025-05-07T20:33:50.4894889Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4895095Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d59f4b0>
2025-05-07T20:33:50.4895100Z 
2025-05-07T20:33:50.4895276Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4895550Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4895671Z                            module_map=module_map)
2025-05-07T20:33:50.4895836Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4895938Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.4896023Z E       ^
2025-05-07T20:33:50.4896393Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4896398Z 
2025-05-07T20:33:50.4896832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4896881Z 
2025-05-07T20:33:50.4896992Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4897227Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4897313Z     T=1,
2025-05-07T20:33:50.4897396Z     D=7168,
2025-05-07T20:33:50.4897483Z     scale_ub=1200.0,
2025-05-07T20:33:50.4897578Z     contiguous=True,
2025-05-07T20:33:50.4897662Z     compiled=True,
2025-05-07T20:33:50.4897740Z )
2025-05-07T20:33:50.4897970Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4898136Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:50.4898141Z 
2025-05-07T20:33:50.4898223Z     @given(
2025-05-07T20:33:50.4898347Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4898448Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4898568Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4898687Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4898848Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4898931Z     )
2025-05-07T20:33:50.4899183Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4899280Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4899361Z         self,
2025-05-07T20:33:50.4899480Z         T: int,
2025-05-07T20:33:50.4899558Z         D: int,
2025-05-07T20:33:50.4899662Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4899755Z         contiguous: bool,
2025-05-07T20:33:50.4899845Z         compiled: bool,
2025-05-07T20:33:50.4899927Z     ) -> None:
2025-05-07T20:33:50.4900020Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4900094Z     
2025-05-07T20:33:50.4900263Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4900340Z     
2025-05-07T20:33:50.4900436Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4900561Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4900649Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4900739Z         x0 = x[:, :D]
2025-05-07T20:33:50.4900819Z         x1 = x[:, D:]
2025-05-07T20:33:50.4900890Z     
2025-05-07T20:33:50.4900977Z         if contiguous:
2025-05-07T20:33:50.4901108Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4901197Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4901276Z     
2025-05-07T20:33:50.4901364Z         if scale_ub is not None:
2025-05-07T20:33:50.4901468Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4901606Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4901683Z             )
2025-05-07T20:33:50.4901765Z         else:
2025-05-07T20:33:50.4901860Z             scale_ub_tensor = None
2025-05-07T20:33:50.4901931Z     
2025-05-07T20:33:50.4902064Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4902154Z             op = silu_mul_quant
2025-05-07T20:33:50.4902241Z             if compiled:
2025-05-07T20:33:50.4902346Z                 op = torch.compile(op)
2025-05-07T20:33:50.4902455Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4902528Z     
2025-05-07T20:33:50.4902623Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.4902627Z 
2025-05-07T20:33:50.4902726Z moe/activation_test.py:117: 
2025-05-07T20:33:50.4902863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4902961Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.4903058Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4903443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.4903534Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.4904051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.4904156Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.4904585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4904823Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4905185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4905283Z     kernel = self.compile(
2025-05-07T20:33:50.4905692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4905870Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4906001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4906009Z 
2025-05-07T20:33:50.4906219Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c0850a870>
2025-05-07T20:33:50.4907071Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4907601Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1da2eac0>}
2025-05-07T20:33:50.4908433Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4908632Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d9ae1f0>
2025-05-07T20:33:50.4908637Z 
2025-05-07T20:33:50.4908809Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4909082Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4909194Z                            module_map=module_map)
2025-05-07T20:33:50.4909366Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4909468Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.4909551Z E       ^
2025-05-07T20:33:50.4909961Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4909969Z 
2025-05-07T20:33:50.4910410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4910415Z 
2025-05-07T20:33:50.4910521Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4910749Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4910836Z     T=1,
2025-05-07T20:33:50.4910924Z     D=7168,
2025-05-07T20:33:50.4911011Z     scale_ub=1200.0,
2025-05-07T20:33:50.4911100Z     contiguous=False,
2025-05-07T20:33:50.4911188Z     compiled=True,
2025-05-07T20:33:50.4911263Z )
2025-05-07T20:33:50.4911490Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4911666Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:50.4911671Z 
2025-05-07T20:33:50.4911751Z     @given(
2025-05-07T20:33:50.4911875Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4911985Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4912100Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4912221Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4912337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4912414Z     )
2025-05-07T20:33:50.4912667Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4912762Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4912840Z         self,
2025-05-07T20:33:50.4912923Z         T: int,
2025-05-07T20:33:50.4913002Z         D: int,
2025-05-07T20:33:50.4913151Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4913246Z         contiguous: bool,
2025-05-07T20:33:50.4913337Z         compiled: bool,
2025-05-07T20:33:50.4913417Z     ) -> None:
2025-05-07T20:33:50.4913518Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4913592Z     
2025-05-07T20:33:50.4913770Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4913848Z     
2025-05-07T20:33:50.4913941Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4914072Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4914163Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4914243Z         x0 = x[:, :D]
2025-05-07T20:33:50.4914332Z         x1 = x[:, D:]
2025-05-07T20:33:50.4914406Z     
2025-05-07T20:33:50.4914491Z         if contiguous:
2025-05-07T20:33:50.4914589Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4914680Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4914755Z     
2025-05-07T20:33:50.4914850Z         if scale_ub is not None:
2025-05-07T20:33:50.4914959Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4915164Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4915248Z             )
2025-05-07T20:33:50.4915323Z         else:
2025-05-07T20:33:50.4915422Z             scale_ub_tensor = None
2025-05-07T20:33:50.4915495Z     
2025-05-07T20:33:50.4915666Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4915758Z             op = silu_mul_quant
2025-05-07T20:33:50.4915841Z             if compiled:
2025-05-07T20:33:50.4915945Z                 op = torch.compile(op)
2025-05-07T20:33:50.4916056Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4916127Z     
2025-05-07T20:33:50.4916218Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.4916223Z 
2025-05-07T20:33:50.4916320Z moe/activation_test.py:117: 
2025-05-07T20:33:50.4916451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4916556Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.4916652Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4917041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.4917178Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.4917699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.4917802Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.4918178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4918406Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4918764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4918858Z     kernel = self.compile(
2025-05-07T20:33:50.4919258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4919444Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4919577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4919582Z 
2025-05-07T20:33:50.4919795Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c0850b950>
2025-05-07T20:33:50.4920649Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4921163Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1da439c0>}
2025-05-07T20:33:50.4921953Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4922191Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d956eb0>
2025-05-07T20:33:50.4922195Z 
2025-05-07T20:33:50.4922372Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4922643Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4922750Z                            module_map=module_map)
2025-05-07T20:33:50.4922916Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4923016Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.4923093Z E       ^
2025-05-07T20:33:50.4923462Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4923467Z 
2025-05-07T20:33:50.4923903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4923910Z 
2025-05-07T20:33:50.4924062Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4924292Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4924370Z     T=1,
2025-05-07T20:33:50.4924452Z     D=7168,
2025-05-07T20:33:50.4924575Z     scale_ub=None,
2025-05-07T20:33:50.4924666Z     contiguous=False,
2025-05-07T20:33:50.4924747Z     compiled=True,
2025-05-07T20:33:50.4924819Z )
2025-05-07T20:33:50.4925042Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4925206Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:50.4925210Z 
2025-05-07T20:33:50.4925288Z     @given(
2025-05-07T20:33:50.4925735Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4925853Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4925969Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4926095Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4926212Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4926290Z     )
2025-05-07T20:33:50.4926634Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4926732Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4926815Z         self,
2025-05-07T20:33:50.4926892Z         T: int,
2025-05-07T20:33:50.4926969Z         D: int,
2025-05-07T20:33:50.4927069Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4927153Z         contiguous: bool,
2025-05-07T20:33:50.4927234Z         compiled: bool,
2025-05-07T20:33:50.4927313Z     ) -> None:
2025-05-07T20:33:50.4927403Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4927473Z     
2025-05-07T20:33:50.4927647Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4927720Z     
2025-05-07T20:33:50.4927806Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4927934Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4928023Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4928105Z         x0 = x[:, :D]
2025-05-07T20:33:50.4928180Z         x1 = x[:, D:]
2025-05-07T20:33:50.4928248Z     
2025-05-07T20:33:50.4928332Z         if contiguous:
2025-05-07T20:33:50.4928420Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4928506Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4928576Z     
2025-05-07T20:33:50.4928663Z         if scale_ub is not None:
2025-05-07T20:33:50.4928762Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4928897Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4928971Z             )
2025-05-07T20:33:50.4929044Z         else:
2025-05-07T20:33:50.4929140Z             scale_ub_tensor = None
2025-05-07T20:33:50.4929209Z     
2025-05-07T20:33:50.4929339Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4929424Z             op = silu_mul_quant
2025-05-07T20:33:50.4929570Z             if compiled:
2025-05-07T20:33:50.4929674Z                 op = torch.compile(op)
2025-05-07T20:33:50.4929778Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4929846Z     
2025-05-07T20:33:50.4929936Z         y_fp8, y_scale = fn()
2025-05-07T20:33:50.4930054Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:50.4930129Z     
2025-05-07T20:33:50.4930293Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4930402Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:50.4930515Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:50.4930639Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:50.4930777Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4930853Z     
2025-05-07T20:33:50.4930951Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:50.4930956Z 
2025-05-07T20:33:50.4931050Z moe/activation_test.py:126: 
2025-05-07T20:33:50.4931248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4931350Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:50.4931479Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:50.4932073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:50.4932232Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:50.4932619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4932843Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4933224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:50.4933487Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:50.4933882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:50.4934053Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:50.4934447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:50.4934575Z     fn()
2025-05-07T20:33:50.4934998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:50.4935075Z     self.fn.run(
2025-05-07T20:33:50.4935425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4935516Z     kernel = self.compile(
2025-05-07T20:33:50.4935910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4936090Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4936224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4936228Z 
2025-05-07T20:33:50.4936436Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d98f980>
2025-05-07T20:33:50.4937245Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4937759Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d968b80>}
2025-05-07T20:33:50.4938542Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4938782Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d982d30>
2025-05-07T20:33:50.4938786Z 
2025-05-07T20:33:50.4938955Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4939233Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4939336Z                            module_map=module_map)
2025-05-07T20:33:50.4939501Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4939599Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:50.4939670Z E       ^
2025-05-07T20:33:50.4940035Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4940039Z 
2025-05-07T20:33:50.4940469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4940473Z 
2025-05-07T20:33:50.4940578Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4940849Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4940930Z     T=1,
2025-05-07T20:33:50.4941008Z     D=5120,
2025-05-07T20:33:50.4941092Z     scale_ub=1200.0,
2025-05-07T20:33:50.4941179Z     contiguous=False,
2025-05-07T20:33:50.4941267Z     compiled=True,
2025-05-07T20:33:50.4941382Z )
2025-05-07T20:33:50.4941603Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4941772Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:50.4941776Z 
2025-05-07T20:33:50.4941852Z     @given(
2025-05-07T20:33:50.4941973Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4942072Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4942188Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4942310Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4942423Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4942500Z     )
2025-05-07T20:33:50.4942759Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4942855Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4942935Z         self,
2025-05-07T20:33:50.4943059Z         T: int,
2025-05-07T20:33:50.4943138Z         D: int,
2025-05-07T20:33:50.4943243Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4943333Z         contiguous: bool,
2025-05-07T20:33:50.4943418Z         compiled: bool,
2025-05-07T20:33:50.4943496Z     ) -> None:
2025-05-07T20:33:50.4943585Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4943653Z     
2025-05-07T20:33:50.4943822Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4943894Z     
2025-05-07T20:33:50.4943981Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4944106Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4944191Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4944268Z         x0 = x[:, :D]
2025-05-07T20:33:50.4944347Z         x1 = x[:, D:]
2025-05-07T20:33:50.4944418Z     
2025-05-07T20:33:50.4944498Z         if contiguous:
2025-05-07T20:33:50.4944587Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4944676Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4944751Z     
2025-05-07T20:33:50.4944839Z         if scale_ub is not None:
2025-05-07T20:33:50.4944942Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4945077Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4945147Z             )
2025-05-07T20:33:50.4945216Z         else:
2025-05-07T20:33:50.4945310Z             scale_ub_tensor = None
2025-05-07T20:33:50.4945377Z     
2025-05-07T20:33:50.4945505Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4945594Z             op = silu_mul_quant
2025-05-07T20:33:50.4945675Z             if compiled:
2025-05-07T20:33:50.4945769Z                 op = torch.compile(op)
2025-05-07T20:33:50.4945942Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4946010Z     
2025-05-07T20:33:50.4946101Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.4946105Z 
2025-05-07T20:33:50.4946196Z moe/activation_test.py:117: 
2025-05-07T20:33:50.4946325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4946428Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.4946527Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4946904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.4946996Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.4947510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.4947609Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.4947983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4948252Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4948608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4948701Z     kernel = self.compile(
2025-05-07T20:33:50.4949140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4949318Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4949443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4949448Z 
2025-05-07T20:33:50.4949656Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d98e6c0>
2025-05-07T20:33:50.4950456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4950972Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d969e40>}
2025-05-07T20:33:50.4951798Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4951993Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d2f64f0>
2025-05-07T20:33:50.4951997Z 
2025-05-07T20:33:50.4952164Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4952430Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4952534Z                            module_map=module_map)
2025-05-07T20:33:50.4952693Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4952794Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.4952879Z E       ^
2025-05-07T20:33:50.4953248Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4953255Z 
2025-05-07T20:33:50.4953683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4953693Z 
2025-05-07T20:33:50.4953792Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4954014Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4954094Z     T=1,
2025-05-07T20:33:50.4954170Z     D=5120,
2025-05-07T20:33:50.4954248Z     scale_ub=1200.0,
2025-05-07T20:33:50.4954332Z     contiguous=False,
2025-05-07T20:33:50.4954414Z     compiled=False,
2025-05-07T20:33:50.4954483Z )
2025-05-07T20:33:50.4954705Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4954916Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:50.4954924Z 
2025-05-07T20:33:50.4955000Z     @given(
2025-05-07T20:33:50.4955117Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4955214Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4955329Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4955443Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4955551Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4955627Z     )
2025-05-07T20:33:50.4955874Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4955961Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4956041Z         self,
2025-05-07T20:33:50.4956113Z         T: int,
2025-05-07T20:33:50.4956189Z         D: int,
2025-05-07T20:33:50.4956287Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4956369Z         contiguous: bool,
2025-05-07T20:33:50.4956453Z         compiled: bool,
2025-05-07T20:33:50.4956525Z     ) -> None:
2025-05-07T20:33:50.4956659Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4956737Z     
2025-05-07T20:33:50.4956903Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4956974Z     
2025-05-07T20:33:50.4957067Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4957230Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4957314Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4957390Z         x0 = x[:, :D]
2025-05-07T20:33:50.4957466Z         x1 = x[:, D:]
2025-05-07T20:33:50.4957536Z     
2025-05-07T20:33:50.4957615Z         if contiguous:
2025-05-07T20:33:50.4957705Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4957790Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4957859Z     
2025-05-07T20:33:50.4957944Z         if scale_ub is not None:
2025-05-07T20:33:50.4958049Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4958182Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4958256Z             )
2025-05-07T20:33:50.4958335Z         else:
2025-05-07T20:33:50.4958428Z             scale_ub_tensor = None
2025-05-07T20:33:50.4958497Z     
2025-05-07T20:33:50.4958668Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4958759Z             op = silu_mul_quant
2025-05-07T20:33:50.4958839Z             if compiled:
2025-05-07T20:33:50.4958938Z                 op = torch.compile(op)
2025-05-07T20:33:50.4959039Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4959109Z     
2025-05-07T20:33:50.4959199Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.4959204Z 
2025-05-07T20:33:50.4959301Z moe/activation_test.py:117: 
2025-05-07T20:33:50.4959435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4959533Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.4959631Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4960160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.4960255Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.4960626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4960855Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4961204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4961293Z     kernel = self.compile(
2025-05-07T20:33:50.4961689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4961859Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4961987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4962035Z 
2025-05-07T20:33:50.4962243Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d98e210>
2025-05-07T20:33:50.4963051Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4963564Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d96aac0>}
2025-05-07T20:33:50.4964343Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4964537Z context = <triton._C.libtriton.ir.context object at 0x7f1b1dc0f6f0>
2025-05-07T20:33:50.4964542Z 
2025-05-07T20:33:50.4964708Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4965023Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4965126Z                            module_map=module_map)
2025-05-07T20:33:50.4965285Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4965425Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.4965497Z E       ^
2025-05-07T20:33:50.4965864Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4965868Z 
2025-05-07T20:33:50.4966299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4966304Z 
2025-05-07T20:33:50.4966403Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4966628Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4966706Z     T=16384,
2025-05-07T20:33:50.4966782Z     D=5120,
2025-05-07T20:33:50.4966874Z     scale_ub=1200.0,
2025-05-07T20:33:50.4966956Z     contiguous=False,
2025-05-07T20:33:50.4967040Z     compiled=True,
2025-05-07T20:33:50.4967111Z )
2025-05-07T20:33:50.4967370Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4967557Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:50.4967561Z 
2025-05-07T20:33:50.4967636Z     @given(
2025-05-07T20:33:50.4967750Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4967847Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4967958Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4968070Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4968183Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4968254Z     )
2025-05-07T20:33:50.4968507Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4968599Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4968675Z         self,
2025-05-07T20:33:50.4968755Z         T: int,
2025-05-07T20:33:50.4968828Z         D: int,
2025-05-07T20:33:50.4968921Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4969011Z         contiguous: bool,
2025-05-07T20:33:50.4969094Z         compiled: bool,
2025-05-07T20:33:50.4969164Z     ) -> None:
2025-05-07T20:33:50.4969256Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4969324Z     
2025-05-07T20:33:50.4969488Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4969562Z     
2025-05-07T20:33:50.4969648Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4969771Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4969855Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4969932Z         x0 = x[:, :D]
2025-05-07T20:33:50.4970014Z         x1 = x[:, D:]
2025-05-07T20:33:50.4970081Z     
2025-05-07T20:33:50.4970222Z         if contiguous:
2025-05-07T20:33:50.4970318Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4970410Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4970483Z     
2025-05-07T20:33:50.4970578Z         if scale_ub is not None:
2025-05-07T20:33:50.4970685Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4970821Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4975189Z             )
2025-05-07T20:33:50.4975293Z         else:
2025-05-07T20:33:50.4975395Z             scale_ub_tensor = None
2025-05-07T20:33:50.4975470Z     
2025-05-07T20:33:50.4975605Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4975697Z             op = silu_mul_quant
2025-05-07T20:33:50.4975785Z             if compiled:
2025-05-07T20:33:50.4975888Z                 op = torch.compile(op)
2025-05-07T20:33:50.4975995Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4976070Z     
2025-05-07T20:33:50.4976163Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.4976173Z 
2025-05-07T20:33:50.4976340Z moe/activation_test.py:117: 
2025-05-07T20:33:50.4976478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4976577Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.4976677Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4977106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.4977196Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.4977714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.4977808Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.4978177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4978403Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4978760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4978854Z     kernel = self.compile(
2025-05-07T20:33:50.4979316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4979499Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4979629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4979634Z 
2025-05-07T20:33:50.4979842Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1dcc6cc0>
2025-05-07T20:33:50.4980646Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4981161Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1dc7c180>}
2025-05-07T20:33:50.4981948Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4982143Z context = <triton._C.libtriton.ir.context object at 0x7f1b1dc53130>
2025-05-07T20:33:50.4982148Z 
2025-05-07T20:33:50.4982310Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4982588Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4982691Z                            module_map=module_map)
2025-05-07T20:33:50.4982848Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4982947Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.4983021Z E       ^
2025-05-07T20:33:50.4983432Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4983440Z 
2025-05-07T20:33:50.4983870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4983874Z 
2025-05-07T20:33:50.4983977Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4984202Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4984281Z     T=2048,
2025-05-07T20:33:50.4984355Z     D=7168,
2025-05-07T20:33:50.4984440Z     scale_ub=1200.0,
2025-05-07T20:33:50.4984523Z     contiguous=False,
2025-05-07T20:33:50.4984602Z     compiled=True,
2025-05-07T20:33:50.4984681Z )
2025-05-07T20:33:50.4984899Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4985075Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:50.4985082Z 
2025-05-07T20:33:50.4985157Z     @given(
2025-05-07T20:33:50.4985318Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4985427Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4985541Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4985663Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4985819Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4985892Z     )
2025-05-07T20:33:50.4986145Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4986244Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4986319Z         self,
2025-05-07T20:33:50.4986401Z         T: int,
2025-05-07T20:33:50.4986476Z         D: int,
2025-05-07T20:33:50.4986575Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4986667Z         contiguous: bool,
2025-05-07T20:33:50.4986750Z         compiled: bool,
2025-05-07T20:33:50.4986827Z     ) -> None:
2025-05-07T20:33:50.4986925Z         torch.manual_seed(2025)
2025-05-07T20:33:50.4986996Z     
2025-05-07T20:33:50.4987166Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.4987241Z     
2025-05-07T20:33:50.4987329Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.4987491Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.4987583Z         x = x_sign * x_clamp
2025-05-07T20:33:50.4987664Z         x0 = x[:, :D]
2025-05-07T20:33:50.4987747Z         x1 = x[:, D:]
2025-05-07T20:33:50.4987814Z     
2025-05-07T20:33:50.4987894Z         if contiguous:
2025-05-07T20:33:50.4987986Z             x0 = x0.contiguous()
2025-05-07T20:33:50.4988071Z             x1 = x1.contiguous()
2025-05-07T20:33:50.4988142Z     
2025-05-07T20:33:50.4988232Z         if scale_ub is not None:
2025-05-07T20:33:50.4988332Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.4988464Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.4988540Z             )
2025-05-07T20:33:50.4988615Z         else:
2025-05-07T20:33:50.4988708Z             scale_ub_tensor = None
2025-05-07T20:33:50.4988784Z     
2025-05-07T20:33:50.4988909Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.4989005Z             op = silu_mul_quant
2025-05-07T20:33:50.4989094Z             if compiled:
2025-05-07T20:33:50.4989198Z                 op = torch.compile(op)
2025-05-07T20:33:50.4989307Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4989382Z     
2025-05-07T20:33:50.4989475Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.4989479Z 
2025-05-07T20:33:50.4989580Z moe/activation_test.py:117: 
2025-05-07T20:33:50.4989712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4989814Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.4989919Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.4990301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.4990442Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.4990961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.4991057Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.4991431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.4991659Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.4992015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.4992107Z     kernel = self.compile(
2025-05-07T20:33:50.4992503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.4992681Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.4992811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.4992856Z 
2025-05-07T20:33:50.4993069Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1dcc47a0>
2025-05-07T20:33:50.4993875Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.4994428Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1dc7cea0>}
2025-05-07T20:33:50.4995217Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.4995408Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d86b9b0>
2025-05-07T20:33:50.4995414Z 
2025-05-07T20:33:50.4995586Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.4995852Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.4996000Z                            module_map=module_map)
2025-05-07T20:33:50.4996168Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.4996268Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.4996343Z E       ^
2025-05-07T20:33:50.4996715Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.4996720Z 
2025-05-07T20:33:50.4997150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.4997155Z 
2025-05-07T20:33:50.4997256Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.4997479Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.4997556Z     T=1,
2025-05-07T20:33:50.4997637Z     D=5120,
2025-05-07T20:33:50.4997717Z     scale_ub=None,
2025-05-07T20:33:50.4997801Z     contiguous=False,
2025-05-07T20:33:50.4997886Z     compiled=False,
2025-05-07T20:33:50.4997959Z )
2025-05-07T20:33:50.4998180Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.4998352Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:50.4998356Z 
2025-05-07T20:33:50.4998432Z     @given(
2025-05-07T20:33:50.4998554Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.4998648Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.4998757Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.4998872Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.4998980Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.4999053Z     )
2025-05-07T20:33:50.4999350Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.4999446Z     def test_silu_mul_quant(
2025-05-07T20:33:50.4999518Z         self,
2025-05-07T20:33:50.4999598Z         T: int,
2025-05-07T20:33:50.4999671Z         D: int,
2025-05-07T20:33:50.4999796Z         scale_ub: Optional[float],
2025-05-07T20:33:50.4999889Z         contiguous: bool,
2025-05-07T20:33:50.4999990Z         compiled: bool,
2025-05-07T20:33:50.5000068Z     ) -> None:
2025-05-07T20:33:50.5000158Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5000228Z     
2025-05-07T20:33:50.5000403Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5000475Z     
2025-05-07T20:33:50.5000561Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5000688Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5000776Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5000855Z         x0 = x[:, :D]
2025-05-07T20:33:50.5000935Z         x1 = x[:, D:]
2025-05-07T20:33:50.5001007Z     
2025-05-07T20:33:50.5001134Z         if contiguous:
2025-05-07T20:33:50.5001223Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5001311Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5001385Z     
2025-05-07T20:33:50.5001473Z         if scale_ub is not None:
2025-05-07T20:33:50.5001576Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5001752Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5001826Z             )
2025-05-07T20:33:50.5001903Z         else:
2025-05-07T20:33:50.5001996Z             scale_ub_tensor = None
2025-05-07T20:33:50.5002065Z     
2025-05-07T20:33:50.5002194Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5002284Z             op = silu_mul_quant
2025-05-07T20:33:50.5002364Z             if compiled:
2025-05-07T20:33:50.5002463Z                 op = torch.compile(op)
2025-05-07T20:33:50.5002564Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5002636Z     
2025-05-07T20:33:50.5002726Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5002734Z 
2025-05-07T20:33:50.5002832Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5002960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5003102Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5003202Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5003723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5003821Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5004192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5004419Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5004774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5004864Z     kernel = self.compile(
2025-05-07T20:33:50.5005265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5005441Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5005571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5005578Z 
2025-05-07T20:33:50.5005782Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1dcc75f0>
2025-05-07T20:33:50.5006585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5007097Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1dc7de40>}
2025-05-07T20:33:50.5007927Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5008129Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d2bedb0>
2025-05-07T20:33:50.5008135Z 
2025-05-07T20:33:50.5008298Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5008569Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5008672Z                            module_map=module_map)
2025-05-07T20:33:50.5008831Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5008930Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5009005Z E       ^
2025-05-07T20:33:50.5009368Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5009375Z 
2025-05-07T20:33:50.5009851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5009856Z 
2025-05-07T20:33:50.5009956Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5010206Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5010373Z     T=4096,
2025-05-07T20:33:50.5010465Z     D=7168,
2025-05-07T20:33:50.5010550Z     scale_ub=1200.0,
2025-05-07T20:33:50.5010634Z     contiguous=False,
2025-05-07T20:33:50.5010716Z     compiled=False,
2025-05-07T20:33:50.5010795Z )
2025-05-07T20:33:50.5011017Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5011201Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:50.5011205Z 
2025-05-07T20:33:50.5011286Z     @given(
2025-05-07T20:33:50.5011404Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5011508Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5011630Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5011745Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5011902Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5011979Z     )
2025-05-07T20:33:50.5012233Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5012327Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5012405Z         self,
2025-05-07T20:33:50.5012484Z         T: int,
2025-05-07T20:33:50.5012560Z         D: int,
2025-05-07T20:33:50.5012654Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5012740Z         contiguous: bool,
2025-05-07T20:33:50.5012824Z         compiled: bool,
2025-05-07T20:33:50.5012896Z     ) -> None:
2025-05-07T20:33:50.5012986Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5013060Z     
2025-05-07T20:33:50.5013230Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5013308Z     
2025-05-07T20:33:50.5013398Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5013518Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5013615Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5013696Z         x0 = x[:, :D]
2025-05-07T20:33:50.5013775Z         x1 = x[:, D:]
2025-05-07T20:33:50.5013853Z     
2025-05-07T20:33:50.5013934Z         if contiguous:
2025-05-07T20:33:50.5014027Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5014120Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5014192Z     
2025-05-07T20:33:50.5014282Z         if scale_ub is not None:
2025-05-07T20:33:50.5014385Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5014587Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5014663Z             )
2025-05-07T20:33:50.5014737Z         else:
2025-05-07T20:33:50.5014828Z             scale_ub_tensor = None
2025-05-07T20:33:50.5014953Z     
2025-05-07T20:33:50.5015079Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5015172Z             op = silu_mul_quant
2025-05-07T20:33:50.5015258Z             if compiled:
2025-05-07T20:33:50.5015353Z                 op = torch.compile(op)
2025-05-07T20:33:50.5015456Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5015531Z     
2025-05-07T20:33:50.5015618Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5015622Z 
2025-05-07T20:33:50.5015721Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5015849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5015946Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5016046Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5016567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5016662Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5017081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5017308Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5017674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5017805Z     kernel = self.compile(
2025-05-07T20:33:50.5018202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5018380Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5018506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5018511Z 
2025-05-07T20:33:50.5018714Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1dcc6ea0>
2025-05-07T20:33:50.5019526Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5020080Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1dc7f380>}
2025-05-07T20:33:50.5020872Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5021063Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d3ce7b0>
2025-05-07T20:33:50.5021067Z 
2025-05-07T20:33:50.5021245Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5021516Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5021624Z                            module_map=module_map)
2025-05-07T20:33:50.5021794Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5021894Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5021970Z E       ^
2025-05-07T20:33:50.5022346Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5022353Z 
2025-05-07T20:33:50.5022785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5022789Z 
2025-05-07T20:33:50.5022893Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5023116Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5023190Z     T=16384,
2025-05-07T20:33:50.5023268Z     D=7168,
2025-05-07T20:33:50.5023346Z     scale_ub=None,
2025-05-07T20:33:50.5023431Z     contiguous=True,
2025-05-07T20:33:50.5023511Z     compiled=True,
2025-05-07T20:33:50.5023581Z )
2025-05-07T20:33:50.5023849Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5024025Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:50.5024030Z 
2025-05-07T20:33:50.5024106Z     @given(
2025-05-07T20:33:50.5024230Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5024328Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5024439Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5024557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5024667Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5024738Z     )
2025-05-07T20:33:50.5024988Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5025078Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5025159Z         self,
2025-05-07T20:33:50.5025232Z         T: int,
2025-05-07T20:33:50.5025307Z         D: int,
2025-05-07T20:33:50.5025603Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5025738Z         contiguous: bool,
2025-05-07T20:33:50.5025950Z         compiled: bool,
2025-05-07T20:33:50.5026034Z     ) -> None:
2025-05-07T20:33:50.5026124Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5026199Z     
2025-05-07T20:33:50.5026378Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5026511Z     
2025-05-07T20:33:50.5026599Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5026725Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5026811Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5026892Z         x0 = x[:, :D]
2025-05-07T20:33:50.5026971Z         x1 = x[:, D:]
2025-05-07T20:33:50.5027044Z     
2025-05-07T20:33:50.5027129Z         if contiguous:
2025-05-07T20:33:50.5027220Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5027306Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5027379Z     
2025-05-07T20:33:50.5027469Z         if scale_ub is not None:
2025-05-07T20:33:50.5027573Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5027712Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5027784Z             )
2025-05-07T20:33:50.5027858Z         else:
2025-05-07T20:33:50.5028015Z             scale_ub_tensor = None
2025-05-07T20:33:50.5028089Z     
2025-05-07T20:33:50.5028220Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5028318Z             op = silu_mul_quant
2025-05-07T20:33:50.5028404Z             if compiled:
2025-05-07T20:33:50.5028509Z                 op = torch.compile(op)
2025-05-07T20:33:50.5028614Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5028689Z     
2025-05-07T20:33:50.5028783Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5028788Z 
2025-05-07T20:33:50.5028885Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5029015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5029118Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5029220Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5029609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.5029701Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.5030219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5030322Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5030694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5030917Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5031277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5031368Z     kernel = self.compile(
2025-05-07T20:33:50.5031766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5032008Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5032142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5032146Z 
2025-05-07T20:33:50.5032356Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d644920>
2025-05-07T20:33:50.5033168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5033680Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d6b44a0>}
2025-05-07T20:33:50.5034508Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5034702Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d60f930>
2025-05-07T20:33:50.5034712Z 
2025-05-07T20:33:50.5034880Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5035192Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5035300Z                            module_map=module_map)
2025-05-07T20:33:50.5035460Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5035558Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5035635Z E       ^
2025-05-07T20:33:50.5036002Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5036007Z 
2025-05-07T20:33:50.5036438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5036447Z 
2025-05-07T20:33:50.5036554Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5036786Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5036902Z     T=4096,
2025-05-07T20:33:50.5036979Z     D=5120,
2025-05-07T20:33:50.5037066Z     scale_ub=None,
2025-05-07T20:33:50.5037153Z     contiguous=False,
2025-05-07T20:33:50.5037237Z     compiled=True,
2025-05-07T20:33:50.5037317Z )
2025-05-07T20:33:50.5037539Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5037720Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:50.5037725Z 
2025-05-07T20:33:50.5037806Z     @given(
2025-05-07T20:33:50.5037925Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5038029Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5038144Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5038264Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5038383Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5038458Z     )
2025-05-07T20:33:50.5038711Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5038808Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5038890Z         self,
2025-05-07T20:33:50.5038970Z         T: int,
2025-05-07T20:33:50.5039043Z         D: int,
2025-05-07T20:33:50.5039139Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5039228Z         contiguous: bool,
2025-05-07T20:33:50.5039312Z         compiled: bool,
2025-05-07T20:33:50.5039386Z     ) -> None:
2025-05-07T20:33:50.5039480Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5039548Z     
2025-05-07T20:33:50.5039714Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5039791Z     
2025-05-07T20:33:50.5039880Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5040048Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5040145Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5040221Z         x0 = x[:, :D]
2025-05-07T20:33:50.5040298Z         x1 = x[:, D:]
2025-05-07T20:33:50.5040374Z     
2025-05-07T20:33:50.5040456Z         if contiguous:
2025-05-07T20:33:50.5040546Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5040639Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5040709Z     
2025-05-07T20:33:50.5040806Z         if scale_ub is not None:
2025-05-07T20:33:50.5040909Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5041039Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5041114Z             )
2025-05-07T20:33:50.5041189Z         else:
2025-05-07T20:33:50.5041280Z             scale_ub_tensor = None
2025-05-07T20:33:50.5041353Z     
2025-05-07T20:33:50.5041478Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5041566Z             op = silu_mul_quant
2025-05-07T20:33:50.5041657Z             if compiled:
2025-05-07T20:33:50.5041825Z                 op = torch.compile(op)
2025-05-07T20:33:50.5041936Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5042004Z     
2025-05-07T20:33:50.5042093Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5042098Z 
2025-05-07T20:33:50.5042233Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5042360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5042457Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5042557Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5042940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.5043029Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.5043548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5043648Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5044026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5044295Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5044649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5044749Z     kernel = self.compile(
2025-05-07T20:33:50.5045147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5045326Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5045453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5045457Z 
2025-05-07T20:33:50.5045665Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d647c20>
2025-05-07T20:33:50.5046475Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5046992Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d6b51c0>}
2025-05-07T20:33:50.5047782Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5047974Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d2b6770>
2025-05-07T20:33:50.5047978Z 
2025-05-07T20:33:50.5048144Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5048417Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5048565Z                            module_map=module_map)
2025-05-07T20:33:50.5048729Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5048828Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5048906Z E       ^
2025-05-07T20:33:50.5049280Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5049287Z 
2025-05-07T20:33:50.5049721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5049726Z 
2025-05-07T20:33:50.5049828Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5050052Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5050124Z     T=4096,
2025-05-07T20:33:50.5050202Z     D=5120,
2025-05-07T20:33:50.5050286Z     scale_ub=1200.0,
2025-05-07T20:33:50.5050371Z     contiguous=False,
2025-05-07T20:33:50.5050453Z     compiled=False,
2025-05-07T20:33:50.5050527Z )
2025-05-07T20:33:50.5050789Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5050971Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:50.5050976Z 
2025-05-07T20:33:50.5051058Z     @given(
2025-05-07T20:33:50.5051178Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5051319Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5051436Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5051557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5051671Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5051749Z     )
2025-05-07T20:33:50.5052007Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5052101Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5052179Z         self,
2025-05-07T20:33:50.5052264Z         T: int,
2025-05-07T20:33:50.5052344Z         D: int,
2025-05-07T20:33:50.5052450Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5052544Z         contiguous: bool,
2025-05-07T20:33:50.5052629Z         compiled: bool,
2025-05-07T20:33:50.5052707Z     ) -> None:
2025-05-07T20:33:50.5052840Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5052912Z     
2025-05-07T20:33:50.5053085Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5053158Z     
2025-05-07T20:33:50.5053248Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5053378Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5053464Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5053540Z         x0 = x[:, :D]
2025-05-07T20:33:50.5053619Z         x1 = x[:, D:]
2025-05-07T20:33:50.5053690Z     
2025-05-07T20:33:50.5053772Z         if contiguous:
2025-05-07T20:33:50.5053863Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5053949Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5054025Z     
2025-05-07T20:33:50.5054112Z         if scale_ub is not None:
2025-05-07T20:33:50.5054217Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5054355Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5054428Z             )
2025-05-07T20:33:50.5054597Z         else:
2025-05-07T20:33:50.5054693Z             scale_ub_tensor = None
2025-05-07T20:33:50.5054766Z     
2025-05-07T20:33:50.5054892Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5054982Z             op = silu_mul_quant
2025-05-07T20:33:50.5055067Z             if compiled:
2025-05-07T20:33:50.5055162Z                 op = torch.compile(op)
2025-05-07T20:33:50.5055270Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5055343Z     
2025-05-07T20:33:50.5055436Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5055441Z 
2025-05-07T20:33:50.5055534Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5055665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5055819Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5055919Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5056442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5056544Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5056919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5057149Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5057504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5057595Z     kernel = self.compile(
2025-05-07T20:33:50.5057999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5058175Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5058348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5058359Z 
2025-05-07T20:33:50.5058570Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d647b30>
2025-05-07T20:33:50.5059372Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5059925Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d6b6160>}
2025-05-07T20:33:50.5060708Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5060907Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d25eeb0>
2025-05-07T20:33:50.5060915Z 
2025-05-07T20:33:50.5061079Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5061389Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5061502Z                            module_map=module_map)
2025-05-07T20:33:50.5061661Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5061759Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5061833Z E       ^
2025-05-07T20:33:50.5062198Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5062203Z 
2025-05-07T20:33:50.5062637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5062642Z 
2025-05-07T20:33:50.5062744Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5062976Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5063055Z     T=4096,
2025-05-07T20:33:50.5063129Z     D=5120,
2025-05-07T20:33:50.5063209Z     scale_ub=1200.0,
2025-05-07T20:33:50.5063295Z     contiguous=False,
2025-05-07T20:33:50.5063375Z     compiled=True,
2025-05-07T20:33:50.5063460Z )
2025-05-07T20:33:50.5063680Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5063855Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:50.5063860Z 
2025-05-07T20:33:50.5063936Z     @given(
2025-05-07T20:33:50.5064052Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5064147Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5064260Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5064374Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5064488Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5064604Z     )
2025-05-07T20:33:50.5064856Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5064949Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5065024Z         self,
2025-05-07T20:33:50.5065102Z         T: int,
2025-05-07T20:33:50.5065180Z         D: int,
2025-05-07T20:33:50.5065276Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5065364Z         contiguous: bool,
2025-05-07T20:33:50.5065451Z         compiled: bool,
2025-05-07T20:33:50.5065526Z     ) -> None:
2025-05-07T20:33:50.5065618Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5065694Z     
2025-05-07T20:33:50.5065862Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5065938Z     
2025-05-07T20:33:50.5066027Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5066150Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5066238Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5066320Z         x0 = x[:, :D]
2025-05-07T20:33:50.5066441Z         x1 = x[:, D:]
2025-05-07T20:33:50.5066518Z     
2025-05-07T20:33:50.5066599Z         if contiguous:
2025-05-07T20:33:50.5066688Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5066782Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5066851Z     
2025-05-07T20:33:50.5066979Z         if scale_ub is not None:
2025-05-07T20:33:50.5067086Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5067221Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5067300Z             )
2025-05-07T20:33:50.5067373Z         else:
2025-05-07T20:33:50.5067464Z             scale_ub_tensor = None
2025-05-07T20:33:50.5067542Z     
2025-05-07T20:33:50.5067668Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5067754Z             op = silu_mul_quant
2025-05-07T20:33:50.5067839Z             if compiled:
2025-05-07T20:33:50.5067935Z                 op = torch.compile(op)
2025-05-07T20:33:50.5068041Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5068113Z     
2025-05-07T20:33:50.5068203Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5068207Z 
2025-05-07T20:33:50.5068303Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5068476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5068577Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5068678Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5069059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.5069149Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.5069670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5069764Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5070138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5070376Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5070732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5070825Z     kernel = self.compile(
2025-05-07T20:33:50.5071224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5071398Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5071532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5071536Z 
2025-05-07T20:33:50.5071745Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d645d00>
2025-05-07T20:33:50.5072550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5073135Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d6b7240>}
2025-05-07T20:33:50.5073922Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5074121Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d2a1870>
2025-05-07T20:33:50.5074126Z 
2025-05-07T20:33:50.5074294Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5074566Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5074671Z                            module_map=module_map)
2025-05-07T20:33:50.5074831Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5074974Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5075054Z E       ^
2025-05-07T20:33:50.5075431Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5075436Z 
2025-05-07T20:33:50.5075866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5075910Z 
2025-05-07T20:33:50.5076014Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5076249Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5076326Z     T=2048,
2025-05-07T20:33:50.5076401Z     D=7168,
2025-05-07T20:33:50.5076486Z     scale_ub=1200.0,
2025-05-07T20:33:50.5076572Z     contiguous=False,
2025-05-07T20:33:50.5076656Z     compiled=False,
2025-05-07T20:33:50.5076732Z )
2025-05-07T20:33:50.5076952Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5077135Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:50.5077140Z 
2025-05-07T20:33:50.5077215Z     @given(
2025-05-07T20:33:50.5077333Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5077476Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5077592Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5077706Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5077820Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5077892Z     )
2025-05-07T20:33:50.5078144Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5078234Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5078307Z         self,
2025-05-07T20:33:50.5078384Z         T: int,
2025-05-07T20:33:50.5078459Z         D: int,
2025-05-07T20:33:50.5078554Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5078649Z         contiguous: bool,
2025-05-07T20:33:50.5078732Z         compiled: bool,
2025-05-07T20:33:50.5078810Z     ) -> None:
2025-05-07T20:33:50.5078905Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5078979Z     
2025-05-07T20:33:50.5079151Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5079223Z     
2025-05-07T20:33:50.5079315Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5079443Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5079528Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5079606Z         x0 = x[:, :D]
2025-05-07T20:33:50.5079687Z         x1 = x[:, D:]
2025-05-07T20:33:50.5079757Z     
2025-05-07T20:33:50.5079840Z         if contiguous:
2025-05-07T20:33:50.5079932Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5080018Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5080087Z     
2025-05-07T20:33:50.5080176Z         if scale_ub is not None:
2025-05-07T20:33:50.5080279Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5080461Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5080540Z             )
2025-05-07T20:33:50.5080613Z         else:
2025-05-07T20:33:50.5080708Z             scale_ub_tensor = None
2025-05-07T20:33:50.5080779Z     
2025-05-07T20:33:50.5080908Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5080999Z             op = silu_mul_quant
2025-05-07T20:33:50.5081080Z             if compiled:
2025-05-07T20:33:50.5081177Z                 op = torch.compile(op)
2025-05-07T20:33:50.5081282Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5081353Z     
2025-05-07T20:33:50.5081440Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5081445Z 
2025-05-07T20:33:50.5081539Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5081665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5081762Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5081862Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5082430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5082533Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5082911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5083175Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5083534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5083623Z     kernel = self.compile(
2025-05-07T20:33:50.5084023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5084197Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5084323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5084331Z 
2025-05-07T20:33:50.5084546Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d245790>
2025-05-07T20:33:50.5085388Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5085906Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d37c220>}
2025-05-07T20:33:50.5086690Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5086883Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d450070>
2025-05-07T20:33:50.5086890Z 
2025-05-07T20:33:50.5087061Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5087332Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5087442Z                            module_map=module_map)
2025-05-07T20:33:50.5087602Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5087702Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5087779Z E       ^
2025-05-07T20:33:50.5088143Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5088147Z 
2025-05-07T20:33:50.5088580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5088585Z 
2025-05-07T20:33:50.5088685Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5088911Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5089033Z     T=1,
2025-05-07T20:33:50.5089114Z     D=7168,
2025-05-07T20:33:50.5089200Z     scale_ub=None,
2025-05-07T20:33:50.5089290Z     contiguous=True,
2025-05-07T20:33:50.5089376Z     compiled=False,
2025-05-07T20:33:50.5089451Z )
2025-05-07T20:33:50.5089680Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5089852Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:50.5089857Z 
2025-05-07T20:33:50.5089940Z     @given(
2025-05-07T20:33:50.5090062Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5090161Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5090279Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5090397Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5090510Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5090590Z     )
2025-05-07T20:33:50.5090841Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5090981Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5091069Z         self,
2025-05-07T20:33:50.5091144Z         T: int,
2025-05-07T20:33:50.5091224Z         D: int,
2025-05-07T20:33:50.5091324Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5091411Z         contiguous: bool,
2025-05-07T20:33:50.5091540Z         compiled: bool,
2025-05-07T20:33:50.5091615Z     ) -> None:
2025-05-07T20:33:50.5091706Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5091780Z     
2025-05-07T20:33:50.5091948Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5092026Z     
2025-05-07T20:33:50.5092118Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5092240Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5092326Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5092405Z         x0 = x[:, :D]
2025-05-07T20:33:50.5092479Z         x1 = x[:, D:]
2025-05-07T20:33:50.5092554Z     
2025-05-07T20:33:50.5092636Z         if contiguous:
2025-05-07T20:33:50.5092731Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5092821Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5092891Z     
2025-05-07T20:33:50.5092979Z         if scale_ub is not None:
2025-05-07T20:33:50.5093130Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5093266Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5093338Z             )
2025-05-07T20:33:50.5093417Z         else:
2025-05-07T20:33:50.5093507Z             scale_ub_tensor = None
2025-05-07T20:33:50.5093579Z     
2025-05-07T20:33:50.5093711Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5093798Z             op = silu_mul_quant
2025-05-07T20:33:50.5093881Z             if compiled:
2025-05-07T20:33:50.5093980Z                 op = torch.compile(op)
2025-05-07T20:33:50.5094082Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5094156Z     
2025-05-07T20:33:50.5094245Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5094249Z 
2025-05-07T20:33:50.5094347Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5094478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5094646Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5094743Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5099487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5099609Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5099996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5100228Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5100592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5100761Z     kernel = self.compile(
2025-05-07T20:33:50.5101174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5101355Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5101490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5101502Z 
2025-05-07T20:33:50.5101714Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d244e60>
2025-05-07T20:33:50.5102523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5103042Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d37d120>}
2025-05-07T20:33:50.5103883Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5104090Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d4640f0>
2025-05-07T20:33:50.5104095Z 
2025-05-07T20:33:50.5104304Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5104577Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5104687Z                            module_map=module_map)
2025-05-07T20:33:50.5104849Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5104952Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5105028Z E       ^
2025-05-07T20:33:50.5105396Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5105401Z 
2025-05-07T20:33:50.5105842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5105846Z 
2025-05-07T20:33:50.5105951Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5106243Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5106336Z     T=16384,
2025-05-07T20:33:50.5106416Z     D=7168,
2025-05-07T20:33:50.5106506Z     scale_ub=1200.0,
2025-05-07T20:33:50.5106595Z     contiguous=False,
2025-05-07T20:33:50.5106682Z     compiled=True,
2025-05-07T20:33:50.5106761Z )
2025-05-07T20:33:50.5106988Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5107173Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:50.5107177Z 
2025-05-07T20:33:50.5107268Z     @given(
2025-05-07T20:33:50.5107391Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5107494Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5107619Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5107741Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5107861Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5107941Z     )
2025-05-07T20:33:50.5108198Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5108300Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5108382Z         self,
2025-05-07T20:33:50.5108463Z         T: int,
2025-05-07T20:33:50.5108549Z         D: int,
2025-05-07T20:33:50.5108651Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5108744Z         contiguous: bool,
2025-05-07T20:33:50.5108837Z         compiled: bool,
2025-05-07T20:33:50.5108914Z     ) -> None:
2025-05-07T20:33:50.5109013Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5109089Z     
2025-05-07T20:33:50.5109269Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5109406Z     
2025-05-07T20:33:50.5109500Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5109628Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5109723Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5109808Z         x0 = x[:, :D]
2025-05-07T20:33:50.5109894Z         x1 = x[:, D:]
2025-05-07T20:33:50.5109976Z     
2025-05-07T20:33:50.5110068Z         if contiguous:
2025-05-07T20:33:50.5110163Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5110264Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5110344Z     
2025-05-07T20:33:50.5110443Z         if scale_ub is not None:
2025-05-07T20:33:50.5110558Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5110696Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5110781Z             )
2025-05-07T20:33:50.5110864Z         else:
2025-05-07T20:33:50.5110962Z             scale_ub_tensor = None
2025-05-07T20:33:50.5111042Z     
2025-05-07T20:33:50.5111171Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5111270Z             op = silu_mul_quant
2025-05-07T20:33:50.5111404Z             if compiled:
2025-05-07T20:33:50.5111513Z                 op = torch.compile(op)
2025-05-07T20:33:50.5111626Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5111705Z     
2025-05-07T20:33:50.5111839Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5111844Z 
2025-05-07T20:33:50.5111944Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5112081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5112185Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5112292Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5112680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.5112774Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.5113295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5113400Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5113775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5114047Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5114406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5114502Z     kernel = self.compile(
2025-05-07T20:33:50.5114906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5115082Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5115216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5115220Z 
2025-05-07T20:33:50.5115429Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d245df0>
2025-05-07T20:33:50.5116253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5116766Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d37e520>}
2025-05-07T20:33:50.5117558Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5117760Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d317bf0>
2025-05-07T20:33:50.5117764Z 
2025-05-07T20:33:50.5117930Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5118206Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5118360Z                            module_map=module_map)
2025-05-07T20:33:50.5118522Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5118632Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5118714Z E       ^
2025-05-07T20:33:50.5119089Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5119094Z 
2025-05-07T20:33:50.5119527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5119531Z 
2025-05-07T20:33:50.5119633Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5119864Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5119943Z     T=1,
2025-05-07T20:33:50.5120022Z     D=7168,
2025-05-07T20:33:50.5120107Z     scale_ub=None,
2025-05-07T20:33:50.5120207Z     contiguous=False,
2025-05-07T20:33:50.5120308Z     compiled=False,
2025-05-07T20:33:50.5120447Z )
2025-05-07T20:33:50.5120674Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5120849Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:50.5120893Z 
2025-05-07T20:33:50.5120975Z     @given(
2025-05-07T20:33:50.5121097Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5121205Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5121321Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5121439Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5121557Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5121636Z     )
2025-05-07T20:33:50.5121890Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5121985Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5122069Z         self,
2025-05-07T20:33:50.5122153Z         T: int,
2025-05-07T20:33:50.5122236Z         D: int,
2025-05-07T20:33:50.5122339Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5122437Z         contiguous: bool,
2025-05-07T20:33:50.5122526Z         compiled: bool,
2025-05-07T20:33:50.5122647Z     ) -> None:
2025-05-07T20:33:50.5122748Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5122826Z     
2025-05-07T20:33:50.5122996Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5123075Z     
2025-05-07T20:33:50.5123166Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5123293Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5123380Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5123462Z         x0 = x[:, :D]
2025-05-07T20:33:50.5123545Z         x1 = x[:, D:]
2025-05-07T20:33:50.5123619Z     
2025-05-07T20:33:50.5123703Z         if contiguous:
2025-05-07T20:33:50.5123804Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5123899Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5123973Z     
2025-05-07T20:33:50.5124072Z         if scale_ub is not None:
2025-05-07T20:33:50.5124178Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5124316Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5124395Z             )
2025-05-07T20:33:50.5124473Z         else:
2025-05-07T20:33:50.5124567Z             scale_ub_tensor = None
2025-05-07T20:33:50.5124643Z     
2025-05-07T20:33:50.5124771Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5124870Z             op = silu_mul_quant
2025-05-07T20:33:50.5124959Z             if compiled:
2025-05-07T20:33:50.5125060Z                 op = torch.compile(op)
2025-05-07T20:33:50.5125167Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5125242Z     
2025-05-07T20:33:50.5125334Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5125338Z 
2025-05-07T20:33:50.5125777Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5126021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5126121Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5126223Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5126746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5126846Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5127218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5127445Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5127800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5127889Z     kernel = self.compile(
2025-05-07T20:33:50.5128295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5128539Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5128666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5128670Z 
2025-05-07T20:33:50.5128881Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d247530>
2025-05-07T20:33:50.5129744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5130310Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d37f100>}
2025-05-07T20:33:50.5131093Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5131292Z context = <triton._C.libtriton.ir.context object at 0x7f1c083c6f30>
2025-05-07T20:33:50.5131296Z 
2025-05-07T20:33:50.5131523Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5131791Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5131899Z                            module_map=module_map)
2025-05-07T20:33:50.5132063Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5132160Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5132234Z E       ^
2025-05-07T20:33:50.5132600Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5132604Z 
2025-05-07T20:33:50.5133036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5133045Z 
2025-05-07T20:33:50.5133145Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5133371Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5133451Z     T=2048,
2025-05-07T20:33:50.5133532Z     D=7168,
2025-05-07T20:33:50.5133609Z     scale_ub=None,
2025-05-07T20:33:50.5133698Z     contiguous=False,
2025-05-07T20:33:50.5133777Z     compiled=True,
2025-05-07T20:33:50.5133848Z )
2025-05-07T20:33:50.5134072Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5134244Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:50.5134248Z 
2025-05-07T20:33:50.5134329Z     @given(
2025-05-07T20:33:50.5134444Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5134603Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5134717Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5134833Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5134997Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5135069Z     )
2025-05-07T20:33:50.5135316Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5135410Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5135487Z         self,
2025-05-07T20:33:50.5135564Z         T: int,
2025-05-07T20:33:50.5135637Z         D: int,
2025-05-07T20:33:50.5135738Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5135823Z         contiguous: bool,
2025-05-07T20:33:50.5135909Z         compiled: bool,
2025-05-07T20:33:50.5135987Z     ) -> None:
2025-05-07T20:33:50.5136077Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5136150Z     
2025-05-07T20:33:50.5136323Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5136393Z     
2025-05-07T20:33:50.5136488Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5136608Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5136695Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5136821Z         x0 = x[:, :D]
2025-05-07T20:33:50.5136903Z         x1 = x[:, D:]
2025-05-07T20:33:50.5136978Z     
2025-05-07T20:33:50.5137069Z         if contiguous:
2025-05-07T20:33:50.5137165Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5137325Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5137401Z     
2025-05-07T20:33:50.5137498Z         if scale_ub is not None:
2025-05-07T20:33:50.5137607Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5137741Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5137820Z             )
2025-05-07T20:33:50.5137904Z         else:
2025-05-07T20:33:50.5138002Z             scale_ub_tensor = None
2025-05-07T20:33:50.5138078Z     
2025-05-07T20:33:50.5138206Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5138292Z             op = silu_mul_quant
2025-05-07T20:33:50.5138372Z             if compiled:
2025-05-07T20:33:50.5138481Z                 op = torch.compile(op)
2025-05-07T20:33:50.5138585Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5138659Z     
2025-05-07T20:33:50.5138744Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5138749Z 
2025-05-07T20:33:50.5138886Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5139026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5139124Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5139220Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5139602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.5139692Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.5140211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5140306Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5140681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5140910Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5141264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5141356Z     kernel = self.compile(
2025-05-07T20:33:50.5141759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5141930Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5142059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5142063Z 
2025-05-07T20:33:50.5142265Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c083973e0>
2025-05-07T20:33:50.5143069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5143634Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c08400720>}
2025-05-07T20:33:50.5144428Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5144623Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d4a94b0>
2025-05-07T20:33:50.5144627Z 
2025-05-07T20:33:50.5144794Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5145064Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5145171Z                            module_map=module_map)
2025-05-07T20:33:50.5145377Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5145485Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5145560Z E       ^
2025-05-07T20:33:50.5145926Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5145970Z 
2025-05-07T20:33:50.5146407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5146412Z 
2025-05-07T20:33:50.5146512Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5146741Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5146819Z     T=4096,
2025-05-07T20:33:50.5146898Z     D=7168,
2025-05-07T20:33:50.5146981Z     scale_ub=None,
2025-05-07T20:33:50.5147064Z     contiguous=False,
2025-05-07T20:33:50.5147147Z     compiled=True,
2025-05-07T20:33:50.5147231Z )
2025-05-07T20:33:50.5147453Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5147632Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:50.5147636Z 
2025-05-07T20:33:50.5147714Z     @given(
2025-05-07T20:33:50.5147871Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5147980Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5148096Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5148217Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5148341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5148416Z     )
2025-05-07T20:33:50.5148668Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5148764Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5148846Z         self,
2025-05-07T20:33:50.5148927Z         T: int,
2025-05-07T20:33:50.5149006Z         D: int,
2025-05-07T20:33:50.5149104Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5149191Z         contiguous: bool,
2025-05-07T20:33:50.5149278Z         compiled: bool,
2025-05-07T20:33:50.5149354Z     ) -> None:
2025-05-07T20:33:50.5149454Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5149523Z     
2025-05-07T20:33:50.5149694Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5149774Z     
2025-05-07T20:33:50.5149862Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5149983Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5150068Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5150146Z         x0 = x[:, :D]
2025-05-07T20:33:50.5150225Z         x1 = x[:, D:]
2025-05-07T20:33:50.5150300Z     
2025-05-07T20:33:50.5150380Z         if contiguous:
2025-05-07T20:33:50.5150469Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5150560Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5150630Z     
2025-05-07T20:33:50.5150720Z         if scale_ub is not None:
2025-05-07T20:33:50.5150873Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5151007Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5151090Z             )
2025-05-07T20:33:50.5151162Z         else:
2025-05-07T20:33:50.5151260Z             scale_ub_tensor = None
2025-05-07T20:33:50.5151337Z     
2025-05-07T20:33:50.5151465Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5151551Z             op = silu_mul_quant
2025-05-07T20:33:50.5151639Z             if compiled:
2025-05-07T20:33:50.5151736Z                 op = torch.compile(op)
2025-05-07T20:33:50.5151840Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5151915Z     
2025-05-07T20:33:50.5152001Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5152005Z 
2025-05-07T20:33:50.5152105Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5152235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5152334Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5152434Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5152858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.5152951Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.5153473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5153609Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5153987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5154215Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5154570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5154665Z     kernel = self.compile(
2025-05-07T20:33:50.5155066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5155243Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5155427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5155432Z 
2025-05-07T20:33:50.5155642Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c08394ec0>
2025-05-07T20:33:50.5156451Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5156961Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c08401440>}
2025-05-07T20:33:50.5157755Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5157953Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d46bc70>
2025-05-07T20:33:50.5157958Z 
2025-05-07T20:33:50.5158130Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5158408Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5158514Z                            module_map=module_map)
2025-05-07T20:33:50.5158686Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5158787Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5158867Z E       ^
2025-05-07T20:33:50.5159236Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5159241Z 
2025-05-07T20:33:50.5159723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5159773Z 
2025-05-07T20:33:50.5159876Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5160106Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5160187Z     T=16384,
2025-05-07T20:33:50.5160266Z     D=5120,
2025-05-07T20:33:50.5160350Z     scale_ub=1200.0,
2025-05-07T20:33:50.5160435Z     contiguous=False,
2025-05-07T20:33:50.5160521Z     compiled=False,
2025-05-07T20:33:50.5160597Z )
2025-05-07T20:33:50.5160817Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5161003Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:50.5161007Z 
2025-05-07T20:33:50.5161083Z     @given(
2025-05-07T20:33:50.5161197Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5161297Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5161408Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5161572Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5161684Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5161759Z     )
2025-05-07T20:33:50.5162007Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5162135Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5162213Z         self,
2025-05-07T20:33:50.5162288Z         T: int,
2025-05-07T20:33:50.5162363Z         D: int,
2025-05-07T20:33:50.5162462Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5162547Z         contiguous: bool,
2025-05-07T20:33:50.5162626Z         compiled: bool,
2025-05-07T20:33:50.5162706Z     ) -> None:
2025-05-07T20:33:50.5162796Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5162869Z     
2025-05-07T20:33:50.5163037Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5163109Z     
2025-05-07T20:33:50.5163198Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5163321Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5163408Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5163488Z         x0 = x[:, :D]
2025-05-07T20:33:50.5163562Z         x1 = x[:, D:]
2025-05-07T20:33:50.5163676Z     
2025-05-07T20:33:50.5163765Z         if contiguous:
2025-05-07T20:33:50.5163862Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5163952Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5164027Z     
2025-05-07T20:33:50.5164119Z         if scale_ub is not None:
2025-05-07T20:33:50.5164229Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5164364Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5164438Z             )
2025-05-07T20:33:50.5164518Z         else:
2025-05-07T20:33:50.5164612Z             scale_ub_tensor = None
2025-05-07T20:33:50.5164687Z     
2025-05-07T20:33:50.5164818Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5164912Z             op = silu_mul_quant
2025-05-07T20:33:50.5165000Z             if compiled:
2025-05-07T20:33:50.5165105Z                 op = torch.compile(op)
2025-05-07T20:33:50.5165211Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5165287Z     
2025-05-07T20:33:50.5165386Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5165390Z 
2025-05-07T20:33:50.5165490Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5165626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5165726Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5165825Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5166348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5166443Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5166815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5167094Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5167449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5167550Z     kernel = self.compile(
2025-05-07T20:33:50.5167946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5168122Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5168252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5168256Z 
2025-05-07T20:33:50.5168460Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c08397fb0>
2025-05-07T20:33:50.5169264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5169841Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c08402340>}
2025-05-07T20:33:50.5170628Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5170862Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d0a9c30>
2025-05-07T20:33:50.5170866Z 
2025-05-07T20:33:50.5171036Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5171309Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5171417Z                            module_map=module_map)
2025-05-07T20:33:50.5171581Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5171688Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5171767Z E       ^
2025-05-07T20:33:50.5172138Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5172148Z 
2025-05-07T20:33:50.5172619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5172627Z 
2025-05-07T20:33:50.5172728Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5172959Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5173035Z     T=16384,
2025-05-07T20:33:50.5173112Z     D=5120,
2025-05-07T20:33:50.5173197Z     scale_ub=1200.0,
2025-05-07T20:33:50.5173281Z     contiguous=True,
2025-05-07T20:33:50.5173360Z     compiled=True,
2025-05-07T20:33:50.5173437Z )
2025-05-07T20:33:50.5173659Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5173844Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:50.5173851Z 
2025-05-07T20:33:50.5173928Z     @given(
2025-05-07T20:33:50.5174041Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5174141Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5174252Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5174369Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5174485Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5174654Z     )
2025-05-07T20:33:50.5174904Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5174993Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5175065Z         self,
2025-05-07T20:33:50.5175142Z         T: int,
2025-05-07T20:33:50.5175216Z         D: int,
2025-05-07T20:33:50.5175312Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5175402Z         contiguous: bool,
2025-05-07T20:33:50.5175537Z         compiled: bool,
2025-05-07T20:33:50.5175611Z     ) -> None:
2025-05-07T20:33:50.5175710Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5175782Z     
2025-05-07T20:33:50.5175949Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5176031Z     
2025-05-07T20:33:50.5176120Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5176242Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5176333Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5176411Z         x0 = x[:, :D]
2025-05-07T20:33:50.5176493Z         x1 = x[:, D:]
2025-05-07T20:33:50.5176564Z     
2025-05-07T20:33:50.5176644Z         if contiguous:
2025-05-07T20:33:50.5176738Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5176825Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5176895Z     
2025-05-07T20:33:50.5176986Z         if scale_ub is not None:
2025-05-07T20:33:50.5177089Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5177223Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5177307Z             )
2025-05-07T20:33:50.5177429Z         else:
2025-05-07T20:33:50.5177522Z             scale_ub_tensor = None
2025-05-07T20:33:50.5177598Z     
2025-05-07T20:33:50.5177728Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5177823Z             op = silu_mul_quant
2025-05-07T20:33:50.5177947Z             if compiled:
2025-05-07T20:33:50.5178045Z                 op = torch.compile(op)
2025-05-07T20:33:50.5178154Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5178226Z     
2025-05-07T20:33:50.5178318Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5178323Z 
2025-05-07T20:33:50.5178422Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5178553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5178653Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5178757Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5179147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.5179244Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.5179801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5179903Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5180277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5180501Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5180853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5180948Z     kernel = self.compile(
2025-05-07T20:33:50.5181346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5181524Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5181657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5181661Z 
2025-05-07T20:33:50.5181877Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c08395e80>
2025-05-07T20:33:50.5182688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5183201Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1c084039c0>}
2025-05-07T20:33:50.5183998Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5184236Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d4bfdb0>
2025-05-07T20:33:50.5184241Z 
2025-05-07T20:33:50.5184414Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5184683Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5184789Z                            module_map=module_map)
2025-05-07T20:33:50.5184953Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5185047Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5185122Z E       ^
2025-05-07T20:33:50.5185493Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5185498Z 
2025-05-07T20:33:50.5185929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5185934Z 
2025-05-07T20:33:50.5186042Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5186312Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5186396Z     T=16384,
2025-05-07T20:33:50.5186479Z     D=5120,
2025-05-07T20:33:50.5186559Z     scale_ub=None,
2025-05-07T20:33:50.5186647Z     contiguous=False,
2025-05-07T20:33:50.5186778Z     compiled=True,
2025-05-07T20:33:50.5186849Z )
2025-05-07T20:33:50.5187072Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5187256Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:50.5187261Z 
2025-05-07T20:33:50.5187340Z     @given(
2025-05-07T20:33:50.5187462Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5187562Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5187676Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5187800Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5187917Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5187995Z     )
2025-05-07T20:33:50.5188252Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5188344Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5188467Z         self,
2025-05-07T20:33:50.5188548Z         T: int,
2025-05-07T20:33:50.5188626Z         D: int,
2025-05-07T20:33:50.5188727Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5188812Z         contiguous: bool,
2025-05-07T20:33:50.5188894Z         compiled: bool,
2025-05-07T20:33:50.5188972Z     ) -> None:
2025-05-07T20:33:50.5189062Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5189136Z     
2025-05-07T20:33:50.5189307Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5189383Z     
2025-05-07T20:33:50.5189473Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5189601Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5189689Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5189770Z         x0 = x[:, :D]
2025-05-07T20:33:50.5189853Z         x1 = x[:, D:]
2025-05-07T20:33:50.5189926Z     
2025-05-07T20:33:50.5190011Z         if contiguous:
2025-05-07T20:33:50.5190102Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5190194Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5190272Z     
2025-05-07T20:33:50.5190360Z         if scale_ub is not None:
2025-05-07T20:33:50.5190461Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5190595Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5190667Z             )
2025-05-07T20:33:50.5190744Z         else:
2025-05-07T20:33:50.5190840Z             scale_ub_tensor = None
2025-05-07T20:33:50.5190912Z     
2025-05-07T20:33:50.5191038Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5191134Z             op = silu_mul_quant
2025-05-07T20:33:50.5191214Z             if compiled:
2025-05-07T20:33:50.5191314Z                 op = torch.compile(op)
2025-05-07T20:33:50.5191464Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5191542Z     
2025-05-07T20:33:50.5191632Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5191637Z 
2025-05-07T20:33:50.5191728Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5191860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5191963Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5192059Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5192438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.5192529Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.5193042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5193146Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5193518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5193789Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5194150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5194240Z     kernel = self.compile(
2025-05-07T20:33:50.5194685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5194858Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5194986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5194990Z 
2025-05-07T20:33:50.5195204Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d47c8f0>
2025-05-07T20:33:50.5196010Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5197276Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d1a0c20>}
2025-05-07T20:33:50.5198063Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5198262Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d1940b0>
2025-05-07T20:33:50.5198267Z 
2025-05-07T20:33:50.5198436Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5198703Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5198813Z                            module_map=module_map)
2025-05-07T20:33:50.5198978Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5199081Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5199159Z E       ^
2025-05-07T20:33:50.5199526Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5199530Z 
2025-05-07T20:33:50.5199967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5199971Z 
2025-05-07T20:33:50.5200070Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5200330Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5200420Z     T=2048,
2025-05-07T20:33:50.5200497Z     D=5120,
2025-05-07T20:33:50.5200580Z     scale_ub=None,
2025-05-07T20:33:50.5200664Z     contiguous=False,
2025-05-07T20:33:50.5200743Z     compiled=True,
2025-05-07T20:33:50.5200814Z )
2025-05-07T20:33:50.5201042Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5201290Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:50.5201294Z 
2025-05-07T20:33:50.5201373Z     @given(
2025-05-07T20:33:50.5201491Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5201589Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5201704Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5201819Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5201929Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5202008Z     )
2025-05-07T20:33:50.5202255Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5202347Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5202423Z         self,
2025-05-07T20:33:50.5202499Z         T: int,
2025-05-07T20:33:50.5202577Z         D: int,
2025-05-07T20:33:50.5202671Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5202757Z         contiguous: bool,
2025-05-07T20:33:50.5202842Z         compiled: bool,
2025-05-07T20:33:50.5202961Z     ) -> None:
2025-05-07T20:33:50.5203057Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5203133Z     
2025-05-07T20:33:50.5203305Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5203418Z     
2025-05-07T20:33:50.5203509Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5203632Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5203723Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5203806Z         x0 = x[:, :D]
2025-05-07T20:33:50.5203888Z         x1 = x[:, D:]
2025-05-07T20:33:50.5203963Z     
2025-05-07T20:33:50.5204052Z         if contiguous:
2025-05-07T20:33:50.5204147Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5204245Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5204318Z     
2025-05-07T20:33:50.5204409Z         if scale_ub is not None:
2025-05-07T20:33:50.5204521Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5204664Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5204743Z             )
2025-05-07T20:33:50.5204831Z         else:
2025-05-07T20:33:50.5204930Z             scale_ub_tensor = None
2025-05-07T20:33:50.5205053Z     
2025-05-07T20:33:50.5205190Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5205290Z             op = silu_mul_quant
2025-05-07T20:33:50.5205381Z             if compiled:
2025-05-07T20:33:50.5205489Z                 op = torch.compile(op)
2025-05-07T20:33:50.5205598Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5205679Z     
2025-05-07T20:33:50.5205770Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5205774Z 
2025-05-07T20:33:50.5205869Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5206003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5206103Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5206207Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5206593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.5206683Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.5207202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5207300Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5207669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5207897Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5208250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5208340Z     kernel = self.compile(
2025-05-07T20:33:50.5208741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5208964Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5209094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5209099Z 
2025-05-07T20:33:50.5209307Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d47e360>
2025-05-07T20:33:50.5210168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5210680Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d1a19e0>}
2025-05-07T20:33:50.5211462Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5211704Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d15e230>
2025-05-07T20:33:50.5211709Z 
2025-05-07T20:33:50.5211878Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5212152Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5212297Z                            module_map=module_map)
2025-05-07T20:33:50.5212453Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5212552Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5212630Z E       ^
2025-05-07T20:33:50.5212995Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5213000Z 
2025-05-07T20:33:50.5213438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5213445Z 
2025-05-07T20:33:50.5213548Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5213777Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5213852Z     T=2048,
2025-05-07T20:33:50.5213966Z     D=5120,
2025-05-07T20:33:50.5214058Z     scale_ub=1200.0,
2025-05-07T20:33:50.5214149Z     contiguous=False,
2025-05-07T20:33:50.5214231Z     compiled=True,
2025-05-07T20:33:50.5214311Z )
2025-05-07T20:33:50.5214615Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5214792Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:50.5214796Z 
2025-05-07T20:33:50.5214879Z     @given(
2025-05-07T20:33:50.5214997Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5215100Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5215211Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5215326Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5215447Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5215517Z     )
2025-05-07T20:33:50.5215766Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5215865Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5215943Z         self,
2025-05-07T20:33:50.5216020Z         T: int,
2025-05-07T20:33:50.5216096Z         D: int,
2025-05-07T20:33:50.5216191Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5216282Z         contiguous: bool,
2025-05-07T20:33:50.5216363Z         compiled: bool,
2025-05-07T20:33:50.5216437Z     ) -> None:
2025-05-07T20:33:50.5216532Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5216601Z     
2025-05-07T20:33:50.5216772Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5216844Z     
2025-05-07T20:33:50.5216933Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5217059Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5217197Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5217276Z         x0 = x[:, :D]
2025-05-07T20:33:50.5217353Z         x1 = x[:, D:]
2025-05-07T20:33:50.5217428Z     
2025-05-07T20:33:50.5217508Z         if contiguous:
2025-05-07T20:33:50.5217599Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5217692Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5217762Z     
2025-05-07T20:33:50.5217855Z         if scale_ub is not None:
2025-05-07T20:33:50.5217958Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5218090Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5218167Z             )
2025-05-07T20:33:50.5218243Z         else:
2025-05-07T20:33:50.5218337Z             scale_ub_tensor = None
2025-05-07T20:33:50.5218409Z     
2025-05-07T20:33:50.5218535Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5218620Z             op = silu_mul_quant
2025-05-07T20:33:50.5218705Z             if compiled:
2025-05-07T20:33:50.5218802Z                 op = torch.compile(op)
2025-05-07T20:33:50.5218947Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5219018Z     
2025-05-07T20:33:50.5219105Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5219112Z 
2025-05-07T20:33:50.5219208Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5219373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5219471Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5219573Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5219955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.5220048Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.5220564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5220662Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5221043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5221267Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5221658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5221757Z     kernel = self.compile(
2025-05-07T20:33:50.5222158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5222329Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5222461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5222466Z 
2025-05-07T20:33:50.5222670Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d47e000>
2025-05-07T20:33:50.5228322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5228885Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d1a2b60>}
2025-05-07T20:33:50.5229685Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5229885Z context = <triton._C.libtriton.ir.context object at 0x7f1b1dd13230>
2025-05-07T20:33:50.5229890Z 
2025-05-07T20:33:50.5230058Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5230330Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5230554Z                            module_map=module_map)
2025-05-07T20:33:50.5230723Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5230828Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5230906Z E       ^
2025-05-07T20:33:50.5231279Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5231287Z 
2025-05-07T20:33:50.5231724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5231728Z 
2025-05-07T20:33:50.5231835Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5232073Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5232151Z     T=4096,
2025-05-07T20:33:50.5232226Z     D=5120,
2025-05-07T20:33:50.5232311Z     scale_ub=1200.0,
2025-05-07T20:33:50.5232395Z     contiguous=True,
2025-05-07T20:33:50.5232475Z     compiled=True,
2025-05-07T20:33:50.5232553Z )
2025-05-07T20:33:50.5232842Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5233018Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:50.5233022Z 
2025-05-07T20:33:50.5233101Z     @given(
2025-05-07T20:33:50.5233222Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5233404Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5233521Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5233645Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5233767Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5233846Z     )
2025-05-07T20:33:50.5234105Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5234204Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5234285Z         self,
2025-05-07T20:33:50.5234369Z         T: int,
2025-05-07T20:33:50.5234459Z         D: int,
2025-05-07T20:33:50.5234563Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5234659Z         contiguous: bool,
2025-05-07T20:33:50.5234752Z         compiled: bool,
2025-05-07T20:33:50.5234828Z     ) -> None:
2025-05-07T20:33:50.5234928Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5235092Z     
2025-05-07T20:33:50.5235266Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5235349Z     
2025-05-07T20:33:50.5235443Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5235574Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5235667Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5235747Z         x0 = x[:, :D]
2025-05-07T20:33:50.5235827Z         x1 = x[:, D:]
2025-05-07T20:33:50.5235905Z     
2025-05-07T20:33:50.5235993Z         if contiguous:
2025-05-07T20:33:50.5236087Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5236185Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5236261Z     
2025-05-07T20:33:50.5236360Z         if scale_ub is not None:
2025-05-07T20:33:50.5236470Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5236613Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5236697Z             )
2025-05-07T20:33:50.5236779Z         else:
2025-05-07T20:33:50.5236876Z             scale_ub_tensor = None
2025-05-07T20:33:50.5236956Z     
2025-05-07T20:33:50.5237087Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5237182Z             op = silu_mul_quant
2025-05-07T20:33:50.5237279Z             if compiled:
2025-05-07T20:33:50.5237386Z                 op = torch.compile(op)
2025-05-07T20:33:50.5237494Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5237578Z     
2025-05-07T20:33:50.5237671Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5237676Z 
2025-05-07T20:33:50.5237785Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5237920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5238072Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5238181Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5238566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.5238663Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.5239189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5239292Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5239668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5239896Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5240251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5240347Z     kernel = self.compile(
2025-05-07T20:33:50.5240795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5240977Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5241111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5241116Z 
2025-05-07T20:33:50.5241369Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d007920>
2025-05-07T20:33:50.5242181Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5242696Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d054180>}
2025-05-07T20:33:50.5244893Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5245090Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d0ff470>
2025-05-07T20:33:50.5245094Z 
2025-05-07T20:33:50.5245300Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5245579Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5245687Z                            module_map=module_map)
2025-05-07T20:33:50.5245850Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5245953Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5246028Z E       ^
2025-05-07T20:33:50.5246406Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5246411Z 
2025-05-07T20:33:50.5246843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5246852Z 
2025-05-07T20:33:50.5246962Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5247198Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5247274Z     T=128,
2025-05-07T20:33:50.5247357Z     D=5120,
2025-05-07T20:33:50.5247441Z     scale_ub=1200.0,
2025-05-07T20:33:50.5247526Z     contiguous=False,
2025-05-07T20:33:50.5247610Z     compiled=True,
2025-05-07T20:33:50.5247687Z )
2025-05-07T20:33:50.5247911Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5248090Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:50.5248094Z 
2025-05-07T20:33:50.5248172Z     @given(
2025-05-07T20:33:50.5248289Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5248393Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5248506Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5248690Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5248803Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5248878Z     )
2025-05-07T20:33:50.5249137Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5249232Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5249310Z         self,
2025-05-07T20:33:50.5249391Z         T: int,
2025-05-07T20:33:50.5249479Z         D: int,
2025-05-07T20:33:50.5249595Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5249701Z         contiguous: bool,
2025-05-07T20:33:50.5249797Z         compiled: bool,
2025-05-07T20:33:50.5249874Z     ) -> None:
2025-05-07T20:33:50.5249975Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5250047Z     
2025-05-07T20:33:50.5250220Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5250296Z     
2025-05-07T20:33:50.5250394Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5250568Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5250661Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5250743Z         x0 = x[:, :D]
2025-05-07T20:33:50.5250828Z         x1 = x[:, D:]
2025-05-07T20:33:50.5250905Z     
2025-05-07T20:33:50.5250988Z         if contiguous:
2025-05-07T20:33:50.5251134Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5251226Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5251301Z     
2025-05-07T20:33:50.5251394Z         if scale_ub is not None:
2025-05-07T20:33:50.5251500Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5251635Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5251721Z             )
2025-05-07T20:33:50.5251798Z         else:
2025-05-07T20:33:50.5251898Z             scale_ub_tensor = None
2025-05-07T20:33:50.5251974Z     
2025-05-07T20:33:50.5252107Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5252213Z             op = silu_mul_quant
2025-05-07T20:33:50.5252302Z             if compiled:
2025-05-07T20:33:50.5252412Z                 op = torch.compile(op)
2025-05-07T20:33:50.5252527Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5252607Z     
2025-05-07T20:33:50.5252745Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5252752Z 
2025-05-07T20:33:50.5252856Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5252991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5253100Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5253202Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5253590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.5253691Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.5254216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5254318Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5254777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5255008Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5255368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5255463Z     kernel = self.compile(
2025-05-07T20:33:50.5255864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5256041Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5256173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5256178Z 
2025-05-07T20:33:50.5256394Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d005790>
2025-05-07T20:33:50.5257257Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5257779Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d054ea0>}
2025-05-07T20:33:50.5258572Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5258766Z context = <triton._C.libtriton.ir.context object at 0x7f1b1d0d3430>
2025-05-07T20:33:50.5258771Z 
2025-05-07T20:33:50.5258943Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5259216Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5259365Z                            module_map=module_map)
2025-05-07T20:33:50.5259533Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5259635Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5259720Z E       ^
2025-05-07T20:33:50.5260136Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5260191Z 
2025-05-07T20:33:50.5260629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5260634Z 
2025-05-07T20:33:50.5260744Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5260970Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5261048Z     T=16384,
2025-05-07T20:33:50.5261132Z     D=7168,
2025-05-07T20:33:50.5261215Z     scale_ub=1200.0,
2025-05-07T20:33:50.5261301Z     contiguous=True,
2025-05-07T20:33:50.5261387Z     compiled=True,
2025-05-07T20:33:50.5261464Z )
2025-05-07T20:33:50.5261689Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5261867Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:50.5261914Z 
2025-05-07T20:33:50.5261999Z     @given(
2025-05-07T20:33:50.5262129Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5262231Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5262349Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5262471Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5262586Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5262669Z     )
2025-05-07T20:33:50.5262920Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5263019Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5263101Z         self,
2025-05-07T20:33:50.5263184Z         T: int,
2025-05-07T20:33:50.5263263Z         D: int,
2025-05-07T20:33:50.5263373Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5263465Z         contiguous: bool,
2025-05-07T20:33:50.5263552Z         compiled: bool,
2025-05-07T20:33:50.5263633Z     ) -> None:
2025-05-07T20:33:50.5263731Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5263810Z     
2025-05-07T20:33:50.5263986Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5264061Z     
2025-05-07T20:33:50.5264158Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5264286Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5264383Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5264474Z         x0 = x[:, :D]
2025-05-07T20:33:50.5264561Z         x1 = x[:, D:]
2025-05-07T20:33:50.5264641Z     
2025-05-07T20:33:50.5264735Z         if contiguous:
2025-05-07T20:33:50.5264832Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5264929Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5265066Z     
2025-05-07T20:33:50.5265162Z         if scale_ub is not None:
2025-05-07T20:33:50.5265276Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5265417Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5265498Z             )
2025-05-07T20:33:50.5265576Z         else:
2025-05-07T20:33:50.5265679Z             scale_ub_tensor = None
2025-05-07T20:33:50.5265757Z     
2025-05-07T20:33:50.5265891Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5265986Z             op = silu_mul_quant
2025-05-07T20:33:50.5266074Z             if compiled:
2025-05-07T20:33:50.5266179Z                 op = torch.compile(op)
2025-05-07T20:33:50.5266287Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5266362Z     
2025-05-07T20:33:50.5266467Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5266472Z 
2025-05-07T20:33:50.5266572Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5266704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5266858Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5266963Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5267358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.5267538Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.5268055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5268155Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5268526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5268753Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5269112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5269207Z     kernel = self.compile(
2025-05-07T20:33:50.5269614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5269792Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5269961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5269972Z 
2025-05-07T20:33:50.5270182Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1d005280>
2025-05-07T20:33:50.5270992Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5271510Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d0560c0>}
2025-05-07T20:33:50.5272306Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5272509Z context = <triton._C.libtriton.ir.context object at 0x7f1b1cbc1430>
2025-05-07T20:33:50.5272514Z 
2025-05-07T20:33:50.5272683Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5272953Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5273060Z                            module_map=module_map)
2025-05-07T20:33:50.5273221Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5273320Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5273404Z E       ^
2025-05-07T20:33:50.5273778Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5273826Z 
2025-05-07T20:33:50.5274269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5274274Z 
2025-05-07T20:33:50.5274380Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5274608Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5274693Z     T=16384,
2025-05-07T20:33:50.5274772Z     D=5120,
2025-05-07T20:33:50.5274855Z     scale_ub=1200.0,
2025-05-07T20:33:50.5274948Z     contiguous=True,
2025-05-07T20:33:50.5275033Z     compiled=False,
2025-05-07T20:33:50.5275104Z )
2025-05-07T20:33:50.5275334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5275516Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:50.5275521Z 
2025-05-07T20:33:50.5275606Z     @given(
2025-05-07T20:33:50.5275728Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5275834Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5275996Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5276117Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5276240Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5276326Z     )
2025-05-07T20:33:50.5276576Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5276715Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5276790Z         self,
2025-05-07T20:33:50.5276866Z         T: int,
2025-05-07T20:33:50.5276947Z         D: int,
2025-05-07T20:33:50.5277046Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5277139Z         contiguous: bool,
2025-05-07T20:33:50.5277234Z         compiled: bool,
2025-05-07T20:33:50.5277312Z     ) -> None:
2025-05-07T20:33:50.5277410Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5277488Z     
2025-05-07T20:33:50.5277661Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5277744Z     
2025-05-07T20:33:50.5277838Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5277970Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5278066Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5278188Z         x0 = x[:, :D]
2025-05-07T20:33:50.5278269Z         x1 = x[:, D:]
2025-05-07T20:33:50.5278362Z     
2025-05-07T20:33:50.5278448Z         if contiguous:
2025-05-07T20:33:50.5278542Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5278642Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5278717Z     
2025-05-07T20:33:50.5278813Z         if scale_ub is not None:
2025-05-07T20:33:50.5278926Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5279063Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5279143Z             )
2025-05-07T20:33:50.5279229Z         else:
2025-05-07T20:33:50.5279328Z             scale_ub_tensor = None
2025-05-07T20:33:50.5279403Z     
2025-05-07T20:33:50.5279537Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5279632Z             op = silu_mul_quant
2025-05-07T20:33:50.5279720Z             if compiled:
2025-05-07T20:33:50.5279825Z                 op = torch.compile(op)
2025-05-07T20:33:50.5279934Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5280028Z     
2025-05-07T20:33:50.5280143Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5280149Z 
2025-05-07T20:33:50.5280257Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5280413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5280513Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5280610Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5281142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5281241Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5281624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5281898Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5282255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5282359Z     kernel = self.compile(
2025-05-07T20:33:50.5282759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5282939Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5283067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5283071Z 
2025-05-07T20:33:50.5283278Z self = <triton.compiler.compiler.ASTSource object at 0x7f1c08397320>
2025-05-07T20:33:50.5284126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5284649Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1d055a80>}
2025-05-07T20:33:50.5285489Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5285683Z context = <triton._C.libtriton.ir.context object at 0x7f1b1cf93bb0>
2025-05-07T20:33:50.5285688Z 
2025-05-07T20:33:50.5285857Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5286137Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5286246Z                            module_map=module_map)
2025-05-07T20:33:50.5286420Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5286519Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5286595Z E       ^
2025-05-07T20:33:50.5287012Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5287019Z 
2025-05-07T20:33:50.5287453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5287457Z 
2025-05-07T20:33:50.5287563Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5287794Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5287877Z     T=1,
2025-05-07T20:33:50.5287960Z     D=7168,
2025-05-07T20:33:50.5288045Z     scale_ub=1200.0,
2025-05-07T20:33:50.5288134Z     contiguous=False,
2025-05-07T20:33:50.5288232Z     compiled=False,
2025-05-07T20:33:50.5288311Z )
2025-05-07T20:33:50.5288541Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5288723Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:50.5288727Z 
2025-05-07T20:33:50.5288814Z     @given(
2025-05-07T20:33:50.5288943Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5289044Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5289167Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5289288Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5289404Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5289477Z     )
2025-05-07T20:33:50.5289759Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5289871Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5289958Z         self,
2025-05-07T20:33:50.5290041Z         T: int,
2025-05-07T20:33:50.5290123Z         D: int,
2025-05-07T20:33:50.5290221Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5290376Z         contiguous: bool,
2025-05-07T20:33:50.5290466Z         compiled: bool,
2025-05-07T20:33:50.5290549Z     ) -> None:
2025-05-07T20:33:50.5290644Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5290725Z     
2025-05-07T20:33:50.5290897Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5290972Z     
2025-05-07T20:33:50.5291071Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5291194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5291286Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5291376Z         x0 = x[:, :D]
2025-05-07T20:33:50.5291457Z         x1 = x[:, D:]
2025-05-07T20:33:50.5291532Z     
2025-05-07T20:33:50.5291627Z         if contiguous:
2025-05-07T20:33:50.5291719Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5291814Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5291890Z     
2025-05-07T20:33:50.5291982Z         if scale_ub is not None:
2025-05-07T20:33:50.5292097Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5292279Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5292359Z             )
2025-05-07T20:33:50.5292444Z         else:
2025-05-07T20:33:50.5292543Z             scale_ub_tensor = None
2025-05-07T20:33:50.5292622Z     
2025-05-07T20:33:50.5292755Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5292891Z             op = silu_mul_quant
2025-05-07T20:33:50.5292979Z             if compiled:
2025-05-07T20:33:50.5293086Z                 op = torch.compile(op)
2025-05-07T20:33:50.5293193Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5293274Z     
2025-05-07T20:33:50.5293367Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5293371Z 
2025-05-07T20:33:50.5293471Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5293606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5293710Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5293819Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5294354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5294578Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5294957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5295193Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5295551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5295650Z     kernel = self.compile(
2025-05-07T20:33:50.5296050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5296227Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5296362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5296370Z 
2025-05-07T20:33:50.5296582Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1cf6a060>
2025-05-07T20:33:50.5297397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5297914Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1cdfc0e0>}
2025-05-07T20:33:50.5298707Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5298901Z context = <triton._C.libtriton.ir.context object at 0x7f1b1cda19f0>
2025-05-07T20:33:50.5298952Z 
2025-05-07T20:33:50.5299125Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5299406Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5299524Z                            module_map=module_map)
2025-05-07T20:33:50.5299692Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5299800Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5299880Z E       ^
2025-05-07T20:33:50.5300253Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5300257Z 
2025-05-07T20:33:50.5300688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5300692Z 
2025-05-07T20:33:50.5300794Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5301026Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5301105Z     T=4096,
2025-05-07T20:33:50.5301255Z     D=7168,
2025-05-07T20:33:50.5301341Z     scale_ub=1200.0,
2025-05-07T20:33:50.5301431Z     contiguous=False,
2025-05-07T20:33:50.5301519Z     compiled=True,
2025-05-07T20:33:50.5301596Z )
2025-05-07T20:33:50.5301824Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5302050Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:50.5302054Z 
2025-05-07T20:33:50.5302134Z     @given(
2025-05-07T20:33:50.5302252Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5302358Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5302478Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5302606Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5302723Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5302803Z     )
2025-05-07T20:33:50.5303069Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5303171Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5303254Z         self,
2025-05-07T20:33:50.5303337Z         T: int,
2025-05-07T20:33:50.5303415Z         D: int,
2025-05-07T20:33:50.5303558Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5303653Z         contiguous: bool,
2025-05-07T20:33:50.5303738Z         compiled: bool,
2025-05-07T20:33:50.5303816Z     ) -> None:
2025-05-07T20:33:50.5303918Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5303989Z     
2025-05-07T20:33:50.5304159Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5304235Z     
2025-05-07T20:33:50.5304327Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5304452Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5304540Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5304620Z         x0 = x[:, :D]
2025-05-07T20:33:50.5304705Z         x1 = x[:, D:]
2025-05-07T20:33:50.5304779Z     
2025-05-07T20:33:50.5304861Z         if contiguous:
2025-05-07T20:33:50.5304967Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5305058Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5305132Z     
2025-05-07T20:33:50.5305234Z         if scale_ub is not None:
2025-05-07T20:33:50.5305344Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5305482Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5305567Z             )
2025-05-07T20:33:50.5305647Z         else:
2025-05-07T20:33:50.5305747Z             scale_ub_tensor = None
2025-05-07T20:33:50.5305822Z     
2025-05-07T20:33:50.5305952Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5306048Z             op = silu_mul_quant
2025-05-07T20:33:50.5306131Z             if compiled:
2025-05-07T20:33:50.5306235Z                 op = torch.compile(op)
2025-05-07T20:33:50.5306344Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5306468Z     
2025-05-07T20:33:50.5306560Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5306567Z 
2025-05-07T20:33:50.5306672Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5306804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5306915Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5307022Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5307408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.5307506Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.5308020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5308116Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5308494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5308721Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5309124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5309220Z     kernel = self.compile(
2025-05-07T20:33:50.5309620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5309842Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5309972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5309976Z 
2025-05-07T20:33:50.5310184Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1cf694c0>
2025-05-07T20:33:50.5311002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5311529Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1cdfd300>}
2025-05-07T20:33:50.5312367Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5312564Z context = <triton._C.libtriton.ir.context object at 0x7f1b1cd274f0>
2025-05-07T20:33:50.5312569Z 
2025-05-07T20:33:50.5312742Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5313012Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5313124Z                            module_map=module_map)
2025-05-07T20:33:50.5313289Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5313389Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5313473Z E       ^
2025-05-07T20:33:50.5313854Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5313859Z 
2025-05-07T20:33:50.5314296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5314304Z 
2025-05-07T20:33:50.5314425Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5314650Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5314728Z     T=128,
2025-05-07T20:33:50.5314817Z     D=7168,
2025-05-07T20:33:50.5314901Z     scale_ub=1200.0,
2025-05-07T20:33:50.5314989Z     contiguous=False,
2025-05-07T20:33:50.5315083Z     compiled=True,
2025-05-07T20:33:50.5315160Z )
2025-05-07T20:33:50.5315383Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5315561Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:50.5315618Z 
2025-05-07T20:33:50.5315705Z     @given(
2025-05-07T20:33:50.5315839Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5315943Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5316066Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5316193Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5316309Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5316395Z     )
2025-05-07T20:33:50.5316650Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5316746Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5316832Z         self,
2025-05-07T20:33:50.5316916Z         T: int,
2025-05-07T20:33:50.5316992Z         D: int,
2025-05-07T20:33:50.5317092Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5317181Z         contiguous: bool,
2025-05-07T20:33:50.5317264Z         compiled: bool,
2025-05-07T20:33:50.5317354Z     ) -> None:
2025-05-07T20:33:50.5317449Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5317571Z     
2025-05-07T20:33:50.5317749Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5317824Z     
2025-05-07T20:33:50.5317917Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5318044Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5318178Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5318268Z         x0 = x[:, :D]
2025-05-07T20:33:50.5318353Z         x1 = x[:, D:]
2025-05-07T20:33:50.5318430Z     
2025-05-07T20:33:50.5318519Z         if contiguous:
2025-05-07T20:33:50.5318614Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5318705Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5318786Z     
2025-05-07T20:33:50.5318880Z         if scale_ub is not None:
2025-05-07T20:33:50.5318989Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5319129Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5319211Z             )
2025-05-07T20:33:50.5319292Z         else:
2025-05-07T20:33:50.5319395Z             scale_ub_tensor = None
2025-05-07T20:33:50.5319472Z     
2025-05-07T20:33:50.5319605Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5319740Z             op = silu_mul_quant
2025-05-07T20:33:50.5319834Z             if compiled:
2025-05-07T20:33:50.5319940Z                 op = torch.compile(op)
2025-05-07T20:33:50.5320048Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5320124Z     
2025-05-07T20:33:50.5320223Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5320228Z 
2025-05-07T20:33:50.5320329Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5320464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5320574Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5320678Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5321070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.5321171Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.5321690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5321796Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5322172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5322400Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5322759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5322851Z     kernel = self.compile(
2025-05-07T20:33:50.5323254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5323429Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5323610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5323614Z 
2025-05-07T20:33:50.5323826Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1cf69880>
2025-05-07T20:33:50.5324633Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5325152Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1cdfe020>}
2025-05-07T20:33:50.5326186Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5326389Z context = <triton._C.libtriton.ir.context object at 0x7f1b1dd876b0>
2025-05-07T20:33:50.5326397Z 
2025-05-07T20:33:50.5326654Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5326931Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5327041Z                            module_map=module_map)
2025-05-07T20:33:50.5327263Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5327361Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5327450Z E       ^
2025-05-07T20:33:50.5327816Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5327821Z 
2025-05-07T20:33:50.5328252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5328257Z 
2025-05-07T20:33:50.5328356Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5328585Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5328664Z     T=2048,
2025-05-07T20:33:50.5328740Z     D=7168,
2025-05-07T20:33:50.5328819Z     scale_ub=None,
2025-05-07T20:33:50.5328911Z     contiguous=True,
2025-05-07T20:33:50.5329052Z     compiled=True,
2025-05-07T20:33:50.5329132Z )
2025-05-07T20:33:50.5329353Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5329522Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:50.5329526Z 
2025-05-07T20:33:50.5329603Z     @given(
2025-05-07T20:33:50.5329717Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5329813Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5329925Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5330038Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5330147Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5330225Z     )
2025-05-07T20:33:50.5330478Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5330571Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5330647Z         self,
2025-05-07T20:33:50.5330728Z         T: int,
2025-05-07T20:33:50.5330810Z         D: int,
2025-05-07T20:33:50.5330909Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5330998Z         contiguous: bool,
2025-05-07T20:33:50.5331082Z         compiled: bool,
2025-05-07T20:33:50.5331161Z     ) -> None:
2025-05-07T20:33:50.5331254Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5331330Z     
2025-05-07T20:33:50.5331496Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5331569Z     
2025-05-07T20:33:50.5331663Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5331783Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5331867Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5332035Z         x0 = x[:, :D]
2025-05-07T20:33:50.5332113Z         x1 = x[:, D:]
2025-05-07T20:33:50.5332192Z     
2025-05-07T20:33:50.5332277Z         if contiguous:
2025-05-07T20:33:50.5332372Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5332463Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5332541Z     
2025-05-07T20:33:50.5332633Z         if scale_ub is not None:
2025-05-07T20:33:50.5332745Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5332879Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5332956Z             )
2025-05-07T20:33:50.5333035Z         else:
2025-05-07T20:33:50.5333129Z             scale_ub_tensor = None
2025-05-07T20:33:50.5333201Z     
2025-05-07T20:33:50.5333331Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5333416Z             op = silu_mul_quant
2025-05-07T20:33:50.5333504Z             if compiled:
2025-05-07T20:33:50.5333599Z                 op = torch.compile(op)
2025-05-07T20:33:50.5333701Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5333777Z     
2025-05-07T20:33:50.5333911Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5333916Z 
2025-05-07T20:33:50.5334008Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5334146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5334283Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5334387Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5334824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.5334918Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.5335438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5335533Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5335902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5336136Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5336490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5336626Z     kernel = self.compile(
2025-05-07T20:33:50.5337030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5337205Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5337343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5337347Z 
2025-05-07T20:33:50.5337554Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1ddc0380>
2025-05-07T20:33:50.5338359Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5338881Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1cdff240>}
2025-05-07T20:33:50.5339670Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5339867Z context = <triton._C.libtriton.ir.context object at 0x7f1b1dd4d630>
2025-05-07T20:33:50.5339872Z 
2025-05-07T20:33:50.5340040Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5340314Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5340419Z                            module_map=module_map)
2025-05-07T20:33:50.5340583Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5340733Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5340813Z E       ^
2025-05-07T20:33:50.5341180Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5341187Z 
2025-05-07T20:33:50.5341623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5341630Z 
2025-05-07T20:33:50.5341728Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5341955Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5342032Z     T=16384,
2025-05-07T20:33:50.5342108Z     D=5120,
2025-05-07T20:33:50.5342189Z     scale_ub=None,
2025-05-07T20:33:50.5342272Z     contiguous=False,
2025-05-07T20:33:50.5342352Z     compiled=False,
2025-05-07T20:33:50.5342429Z )
2025-05-07T20:33:50.5342649Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5342834Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:50.5342881Z 
2025-05-07T20:33:50.5342962Z     @given(
2025-05-07T20:33:50.5343082Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5343190Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5343307Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5343465Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5343584Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5343656Z     )
2025-05-07T20:33:50.5343906Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5344003Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5344081Z         self,
2025-05-07T20:33:50.5344160Z         T: int,
2025-05-07T20:33:50.5344237Z         D: int,
2025-05-07T20:33:50.5344337Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5344430Z         contiguous: bool,
2025-05-07T20:33:50.5344518Z         compiled: bool,
2025-05-07T20:33:50.5344591Z     ) -> None:
2025-05-07T20:33:50.5344690Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5344763Z     
2025-05-07T20:33:50.5344932Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5345047Z     
2025-05-07T20:33:50.5345137Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5345263Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5347182Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5347191Z 
2025-05-07T20:33:50.5347310Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:50.5347317Z 
2025-05-07T20:33:50.5347418Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5347647Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5347726Z     T=4096,
2025-05-07T20:33:50.5347806Z     D=7168,
2025-05-07T20:33:50.5347889Z     scale_ub=1200.0,
2025-05-07T20:33:50.5347977Z     contiguous=True,
2025-05-07T20:33:50.5348058Z     compiled=True,
2025-05-07T20:33:50.5348127Z )
2025-05-07T20:33:50.5348352Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5348524Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:50.5348529Z 
2025-05-07T20:33:50.5348606Z     @given(
2025-05-07T20:33:50.5348725Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5348822Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5348982Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5349094Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5349203Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5349279Z     )
2025-05-07T20:33:50.5349527Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5349620Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5349695Z         self,
2025-05-07T20:33:50.5349770Z         T: int,
2025-05-07T20:33:50.5349846Z         D: int,
2025-05-07T20:33:50.5349948Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5350033Z         contiguous: bool,
2025-05-07T20:33:50.5350117Z         compiled: bool,
2025-05-07T20:33:50.5350193Z     ) -> None:
2025-05-07T20:33:50.5350283Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5350357Z     
2025-05-07T20:33:50.5350522Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5350595Z     
2025-05-07T20:33:50.5350687Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5350850Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5352763Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5352814Z 
2025-05-07T20:33:50.5352930Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:50.5352935Z 
2025-05-07T20:33:50.5353033Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5353261Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5353342Z     T=16384,
2025-05-07T20:33:50.5353425Z     D=7168,
2025-05-07T20:33:50.5353507Z     scale_ub=None,
2025-05-07T20:33:50.5353591Z     contiguous=False,
2025-05-07T20:33:50.5353715Z     compiled=False,
2025-05-07T20:33:50.5353793Z )
2025-05-07T20:33:50.5354011Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5354194Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:50.5354198Z 
2025-05-07T20:33:50.5355659Z     @given(
2025-05-07T20:33:50.5355777Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5355877Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5360560Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5360689Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5360805Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5360883Z     )
2025-05-07T20:33:50.5361138Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5361234Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5361308Z         self,
2025-05-07T20:33:50.5361383Z         T: int,
2025-05-07T20:33:50.5361461Z         D: int,
2025-05-07T20:33:50.5361556Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5361646Z         contiguous: bool,
2025-05-07T20:33:50.5361730Z         compiled: bool,
2025-05-07T20:33:50.5361806Z     ) -> None:
2025-05-07T20:33:50.5361899Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5361970Z     
2025-05-07T20:33:50.5362140Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5364073Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5364146Z 
2025-05-07T20:33:50.5364265Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:50.5364269Z 
2025-05-07T20:33:50.5364371Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5364596Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5364673Z     T=2048,
2025-05-07T20:33:50.5364751Z     D=7168,
2025-05-07T20:33:50.5364831Z     scale_ub=1200.0,
2025-05-07T20:33:50.5364910Z     contiguous=True,
2025-05-07T20:33:50.5364993Z     compiled=True,
2025-05-07T20:33:50.5365067Z )
2025-05-07T20:33:50.5365291Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5365468Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:50.5365474Z 
2025-05-07T20:33:50.5365594Z     @given(
2025-05-07T20:33:50.5365712Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5365810Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5365923Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5366087Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5366197Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5366267Z     )
2025-05-07T20:33:50.5366524Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5366616Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5366691Z         self,
2025-05-07T20:33:50.5366770Z         T: int,
2025-05-07T20:33:50.5366848Z         D: int,
2025-05-07T20:33:50.5366943Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5367032Z         contiguous: bool,
2025-05-07T20:33:50.5367120Z         compiled: bool,
2025-05-07T20:33:50.5367203Z     ) -> None:
2025-05-07T20:33:50.5367295Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5367366Z     
2025-05-07T20:33:50.5367537Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5367650Z     
2025-05-07T20:33:50.5367743Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5367871Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5369776Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5369785Z 
2025-05-07T20:33:50.5369913Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:50.5369917Z 
2025-05-07T20:33:50.5370017Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5370253Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5370328Z     T=2048,
2025-05-07T20:33:50.5370407Z     D=7168,
2025-05-07T20:33:50.5370496Z     scale_ub=None,
2025-05-07T20:33:50.5370581Z     contiguous=True,
2025-05-07T20:33:50.5370664Z     compiled=False,
2025-05-07T20:33:50.5370740Z )
2025-05-07T20:33:50.5370961Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5371137Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:50.5371142Z 
2025-05-07T20:33:50.5371225Z     @given(
2025-05-07T20:33:50.5371339Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5371435Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5371595Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5371716Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5371829Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5371898Z     )
2025-05-07T20:33:50.5372148Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5372245Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5372318Z         self,
2025-05-07T20:33:50.5372391Z         T: int,
2025-05-07T20:33:50.5372469Z         D: int,
2025-05-07T20:33:50.5372564Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5372651Z         contiguous: bool,
2025-05-07T20:33:50.5372737Z         compiled: bool,
2025-05-07T20:33:50.5372810Z     ) -> None:
2025-05-07T20:33:50.5372903Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5372975Z     
2025-05-07T20:33:50.5373142Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5373224Z     
2025-05-07T20:33:50.5373314Z >       x_sign = torch.sign(x)
2025-05-07T20:33:50.5375345Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5375394Z 
2025-05-07T20:33:50.5375512Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:50.5375516Z 
2025-05-07T20:33:50.5375616Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5375843Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5375916Z     T=1,
2025-05-07T20:33:50.5375997Z     D=7168,
2025-05-07T20:33:50.5376084Z     scale_ub=1200.0,
2025-05-07T20:33:50.5376169Z     contiguous=True,
2025-05-07T20:33:50.5376251Z     compiled=False,
2025-05-07T20:33:50.5376332Z )
2025-05-07T20:33:50.5376589Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5376763Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:50.5376771Z 
2025-05-07T20:33:50.5376849Z     @given(
2025-05-07T20:33:50.5376963Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5377061Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5377177Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5377289Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5377401Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5377474Z     )
2025-05-07T20:33:50.5377722Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5377823Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5377898Z         self,
2025-05-07T20:33:50.5377977Z         T: int,
2025-05-07T20:33:50.5378053Z         D: int,
2025-05-07T20:33:50.5378148Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5378241Z         contiguous: bool,
2025-05-07T20:33:50.5378322Z         compiled: bool,
2025-05-07T20:33:50.5378401Z     ) -> None:
2025-05-07T20:33:50.5378496Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5378567Z     
2025-05-07T20:33:50.5378734Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5378811Z     
2025-05-07T20:33:50.5378903Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5379024Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5379110Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5379185Z         x0 = x[:, :D]
2025-05-07T20:33:50.5379267Z         x1 = x[:, D:]
2025-05-07T20:33:50.5379339Z     
2025-05-07T20:33:50.5379420Z         if contiguous:
2025-05-07T20:33:50.5379555Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5379645Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5379713Z     
2025-05-07T20:33:50.5379804Z         if scale_ub is not None:
2025-05-07T20:33:50.5379910Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5380046Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5380122Z             )
2025-05-07T20:33:50.5380198Z         else:
2025-05-07T20:33:50.5380287Z             scale_ub_tensor = None
2025-05-07T20:33:50.5380362Z     
2025-05-07T20:33:50.5380490Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5380582Z             op = silu_mul_quant
2025-05-07T20:33:50.5380661Z             if compiled:
2025-05-07T20:33:50.5380759Z                 op = torch.compile(op)
2025-05-07T20:33:50.5380863Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5380936Z     
2025-05-07T20:33:50.5381023Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5381030Z 
2025-05-07T20:33:50.5381128Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5381299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5381402Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5381508Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5382032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5382174Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5382550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5382781Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5383136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5383226Z     kernel = self.compile(
2025-05-07T20:33:50.5383633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5383809Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5383978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5383983Z 
2025-05-07T20:33:50.5384200Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1ddc3e90>
2025-05-07T20:33:50.5385012Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5385530Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1dd96520>}
2025-05-07T20:33:50.5386323Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5386521Z context = <triton._C.libtriton.ir.context object at 0x7f1b1cec11b0>
2025-05-07T20:33:50.5386526Z 
2025-05-07T20:33:50.5386702Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5386973Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5387081Z                            module_map=module_map)
2025-05-07T20:33:50.5387241Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5387337Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5387418Z E       ^
2025-05-07T20:33:50.5387784Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5387789Z 
2025-05-07T20:33:50.5388219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5388294Z 
2025-05-07T20:33:50.5388394Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5388619Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5388702Z     T=128,
2025-05-07T20:33:50.5388779Z     D=5120,
2025-05-07T20:33:50.5388861Z     scale_ub=None,
2025-05-07T20:33:50.5388947Z     contiguous=True,
2025-05-07T20:33:50.5389027Z     compiled=False,
2025-05-07T20:33:50.5389102Z )
2025-05-07T20:33:50.5389328Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5389497Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:50.5389502Z 
2025-05-07T20:33:50.5389581Z     @given(
2025-05-07T20:33:50.5389702Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5389798Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5389912Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5390072Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5390185Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5390257Z     )
2025-05-07T20:33:50.5390509Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5390668Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5390749Z         self,
2025-05-07T20:33:50.5390825Z         T: int,
2025-05-07T20:33:50.5390902Z         D: int,
2025-05-07T20:33:50.5391000Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5391084Z         contiguous: bool,
2025-05-07T20:33:50.5391168Z         compiled: bool,
2025-05-07T20:33:50.5391243Z     ) -> None:
2025-05-07T20:33:50.5391333Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5391409Z     
2025-05-07T20:33:50.5391575Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5391649Z     
2025-05-07T20:33:50.5391739Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5391863Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5391952Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5392034Z         x0 = x[:, :D]
2025-05-07T20:33:50.5392111Z         x1 = x[:, D:]
2025-05-07T20:33:50.5392223Z     
2025-05-07T20:33:50.5392310Z         if contiguous:
2025-05-07T20:33:50.5392405Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5392494Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5392571Z     
2025-05-07T20:33:50.5392661Z         if scale_ub is not None:
2025-05-07T20:33:50.5392774Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5392907Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5392984Z             )
2025-05-07T20:33:50.5393065Z         else:
2025-05-07T20:33:50.5393162Z             scale_ub_tensor = None
2025-05-07T20:33:50.5393239Z     
2025-05-07T20:33:50.5393370Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5393460Z             op = silu_mul_quant
2025-05-07T20:33:50.5393542Z             if compiled:
2025-05-07T20:33:50.5393647Z                 op = torch.compile(op)
2025-05-07T20:33:50.5393749Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5393815Z     
2025-05-07T20:33:50.5393906Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5393911Z 
2025-05-07T20:33:50.5394009Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5394144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5394242Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5394345Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5394867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5394964Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5395335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5395609Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5395961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5396056Z     kernel = self.compile(
2025-05-07T20:33:50.5396455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5396629Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5396755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5396760Z 
2025-05-07T20:33:50.5396964Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1ceea0c0>
2025-05-07T20:33:50.5397773Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5398330Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1dd97420>}
2025-05-07T20:33:50.5399123Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5399356Z context = <triton._C.libtriton.ir.context object at 0x7f1b1cc267b0>
2025-05-07T20:33:50.5399361Z 
2025-05-07T20:33:50.5399529Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5399803Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5399908Z                            module_map=module_map)
2025-05-07T20:33:50.5400068Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5400167Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5400242Z E       ^
2025-05-07T20:33:50.5400613Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5400617Z 
2025-05-07T20:33:50.5401089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5401099Z 
2025-05-07T20:33:50.5401200Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5401432Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5401509Z     T=128,
2025-05-07T20:33:50.5401583Z     D=7168,
2025-05-07T20:33:50.5401666Z     scale_ub=None,
2025-05-07T20:33:50.5401750Z     contiguous=True,
2025-05-07T20:33:50.5401832Z     compiled=False,
2025-05-07T20:33:50.5401906Z )
2025-05-07T20:33:50.5402125Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5402303Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:50.5402307Z 
2025-05-07T20:33:50.5402388Z     @given(
2025-05-07T20:33:50.5402504Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5402606Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5402719Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5402837Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5402949Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5403018Z     )
2025-05-07T20:33:50.5403269Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5403360Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5403429Z         self,
2025-05-07T20:33:50.5403512Z         T: int,
2025-05-07T20:33:50.5403584Z         D: int,
2025-05-07T20:33:50.5403680Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5403769Z         contiguous: bool,
2025-05-07T20:33:50.5403902Z         compiled: bool,
2025-05-07T20:33:50.5403980Z     ) -> None:
2025-05-07T20:33:50.5404076Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5404145Z     
2025-05-07T20:33:50.5404315Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5404398Z     
2025-05-07T20:33:50.5404487Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5404613Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5404700Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5404779Z         x0 = x[:, :D]
2025-05-07T20:33:50.5404861Z         x1 = x[:, D:]
2025-05-07T20:33:50.5404937Z     
2025-05-07T20:33:50.5405021Z         if contiguous:
2025-05-07T20:33:50.5405117Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5405206Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5405281Z     
2025-05-07T20:33:50.5405370Z         if scale_ub is not None:
2025-05-07T20:33:50.5405474Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5405603Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5405679Z             )
2025-05-07T20:33:50.5405797Z         else:
2025-05-07T20:33:50.5405899Z             scale_ub_tensor = None
2025-05-07T20:33:50.5405969Z     
2025-05-07T20:33:50.5406097Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5406194Z             op = silu_mul_quant
2025-05-07T20:33:50.5406319Z             if compiled:
2025-05-07T20:33:50.5406418Z                 op = torch.compile(op)
2025-05-07T20:33:50.5406526Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5406600Z     
2025-05-07T20:33:50.5406690Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5406694Z 
2025-05-07T20:33:50.5406793Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5406924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5407029Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5407132Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5407656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5407752Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5408168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5408395Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5408754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5408849Z     kernel = self.compile(
2025-05-07T20:33:50.5409253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5409430Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5409554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5409561Z 
2025-05-07T20:33:50.5409774Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1ceea570>
2025-05-07T20:33:50.5410633Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5411147Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1cc204a0>}
2025-05-07T20:33:50.5411938Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5412132Z context = <triton._C.libtriton.ir.context object at 0x7f1b1cc117f0>
2025-05-07T20:33:50.5412137Z 
2025-05-07T20:33:50.5412313Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5412812Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5412925Z                            module_map=module_map)
2025-05-07T20:33:50.5413089Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5413191Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5413276Z E       ^
2025-05-07T20:33:50.5413646Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5413650Z 
2025-05-07T20:33:50.5414080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5414087Z 
2025-05-07T20:33:50.5414186Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5414412Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5414536Z     T=2048,
2025-05-07T20:33:50.5414614Z     D=7168,
2025-05-07T20:33:50.5414696Z     scale_ub=1200.0,
2025-05-07T20:33:50.5414835Z     contiguous=True,
2025-05-07T20:33:50.5414920Z     compiled=False,
2025-05-07T20:33:50.5414993Z )
2025-05-07T20:33:50.5415226Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5415405Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:50.5415451Z 
2025-05-07T20:33:50.5415530Z     @given(
2025-05-07T20:33:50.5415647Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5415746Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5415863Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5415981Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5416096Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5416176Z     )
2025-05-07T20:33:50.5416427Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5416524Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5416608Z         self,
2025-05-07T20:33:50.5416686Z         T: int,
2025-05-07T20:33:50.5416764Z         D: int,
2025-05-07T20:33:50.5416865Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5416996Z         contiguous: bool,
2025-05-07T20:33:50.5417088Z         compiled: bool,
2025-05-07T20:33:50.5417168Z     ) -> None:
2025-05-07T20:33:50.5417264Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5417340Z     
2025-05-07T20:33:50.5417509Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5419422Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5419435Z 
2025-05-07T20:33:50.5419550Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:50.5419555Z 
2025-05-07T20:33:50.5419683Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5419941Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5420016Z     T=1,
2025-05-07T20:33:50.5420089Z     D=5120,
2025-05-07T20:33:50.5420174Z     scale_ub=1200.0,
2025-05-07T20:33:50.5420255Z     contiguous=True,
2025-05-07T20:33:50.5420335Z     compiled=False,
2025-05-07T20:33:50.5420417Z )
2025-05-07T20:33:50.5420643Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5420804Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:50.5420809Z 
2025-05-07T20:33:50.5420933Z     @given(
2025-05-07T20:33:50.5421054Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5421154Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5421267Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5421392Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5421504Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5421584Z     )
2025-05-07T20:33:50.5421835Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5421930Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5422010Z         self,
2025-05-07T20:33:50.5422086Z         T: int,
2025-05-07T20:33:50.5422164Z         D: int,
2025-05-07T20:33:50.5422264Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5422352Z         contiguous: bool,
2025-05-07T20:33:50.5422436Z         compiled: bool,
2025-05-07T20:33:50.5422514Z     ) -> None:
2025-05-07T20:33:50.5422605Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5422679Z     
2025-05-07T20:33:50.5422916Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5422989Z     
2025-05-07T20:33:50.5423081Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5423207Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5423296Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5423420Z         x0 = x[:, :D]
2025-05-07T20:33:50.5423503Z         x1 = x[:, D:]
2025-05-07T20:33:50.5423575Z     
2025-05-07T20:33:50.5423662Z         if contiguous:
2025-05-07T20:33:50.5423755Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5423845Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5423924Z     
2025-05-07T20:33:50.5424014Z         if scale_ub is not None:
2025-05-07T20:33:50.5424118Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5424256Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5424331Z             )
2025-05-07T20:33:50.5424412Z         else:
2025-05-07T20:33:50.5424512Z             scale_ub_tensor = None
2025-05-07T20:33:50.5424589Z     
2025-05-07T20:33:50.5424722Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5424811Z             op = silu_mul_quant
2025-05-07T20:33:50.5424933Z             if compiled:
2025-05-07T20:33:50.5425035Z                 op = torch.compile(op)
2025-05-07T20:33:50.5425139Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5425210Z     
2025-05-07T20:33:50.5425302Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5425306Z 
2025-05-07T20:33:50.5425727Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5425914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5426055Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5426190Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5426741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5426843Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5427217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5427447Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5427805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5427899Z     kernel = self.compile(
2025-05-07T20:33:50.5428295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5428471Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5428603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5428607Z 
2025-05-07T20:33:50.5428815Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1ceeaf30>
2025-05-07T20:33:50.5429732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5430248Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1cc21a80>}
2025-05-07T20:33:50.5431043Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5431238Z context = <triton._C.libtriton.ir.context object at 0x7f1b1ccd8570>
2025-05-07T20:33:50.5431243Z 
2025-05-07T20:33:50.5431411Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5431684Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5431854Z                            module_map=module_map)
2025-05-07T20:33:50.5432017Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5432120Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5432198Z E       ^
2025-05-07T20:33:50.5432565Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5432633Z 
2025-05-07T20:33:50.5433071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5433076Z 
2025-05-07T20:33:50.5433176Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5433409Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5433486Z     T=2048,
2025-05-07T20:33:50.5433562Z     D=5120,
2025-05-07T20:33:50.5433643Z     scale_ub=None,
2025-05-07T20:33:50.5433724Z     contiguous=True,
2025-05-07T20:33:50.5433808Z     compiled=False,
2025-05-07T20:33:50.5433889Z )
2025-05-07T20:33:50.5434115Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5434292Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:50.5434356Z 
2025-05-07T20:33:50.5434436Z     @given(
2025-05-07T20:33:50.5434557Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5434661Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5434776Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5434893Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5435013Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5435089Z     )
2025-05-07T20:33:50.5435340Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5435437Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5435521Z         self,
2025-05-07T20:33:50.5435604Z         T: int,
2025-05-07T20:33:50.5435683Z         D: int,
2025-05-07T20:33:50.5435788Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5435880Z         contiguous: bool,
2025-05-07T20:33:50.5435966Z         compiled: bool,
2025-05-07T20:33:50.5436044Z     ) -> None:
2025-05-07T20:33:50.5436142Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5436212Z     
2025-05-07T20:33:50.5436381Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5436463Z     
2025-05-07T20:33:50.5436553Z >       x_sign = torch.sign(x)
2025-05-07T20:33:50.5438471Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5438522Z 
2025-05-07T20:33:50.5438637Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:50.5438642Z 
2025-05-07T20:33:50.5438745Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5438978Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5439055Z     T=16384,
2025-05-07T20:33:50.5439135Z     D=5120,
2025-05-07T20:33:50.5439214Z     scale_ub=None,
2025-05-07T20:33:50.5439299Z     contiguous=True,
2025-05-07T20:33:50.5439386Z     compiled=False,
2025-05-07T20:33:50.5439463Z )
2025-05-07T20:33:50.5439680Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5439859Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:50.5439864Z 
2025-05-07T20:33:50.5439938Z     @given(
2025-05-07T20:33:50.5440055Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5440196Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5440309Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5440428Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5440539Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5440653Z     )
2025-05-07T20:33:50.5440911Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5441003Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5441077Z         self,
2025-05-07T20:33:50.5441155Z         T: int,
2025-05-07T20:33:50.5441227Z         D: int,
2025-05-07T20:33:50.5441321Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5441409Z         contiguous: bool,
2025-05-07T20:33:50.5441490Z         compiled: bool,
2025-05-07T20:33:50.5441569Z     ) -> None:
2025-05-07T20:33:50.5441662Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5441735Z     
2025-05-07T20:33:50.5441910Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5443858Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5443868Z 
2025-05-07T20:33:50.5443992Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:50.5443996Z 
2025-05-07T20:33:50.5444097Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5444322Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5444403Z     T=4096,
2025-05-07T20:33:50.5444481Z     D=5120,
2025-05-07T20:33:50.5444564Z     scale_ub=None,
2025-05-07T20:33:50.5444652Z     contiguous=True,
2025-05-07T20:33:50.5444735Z     compiled=False,
2025-05-07T20:33:50.5444808Z )
2025-05-07T20:33:50.5445029Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5445207Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:50.5445211Z 
2025-05-07T20:33:50.5445293Z     @given(
2025-05-07T20:33:50.5445409Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5445505Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5445619Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5445732Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5445841Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5445915Z     )
2025-05-07T20:33:50.5446161Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5446304Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5446380Z         self,
2025-05-07T20:33:50.5446455Z         T: int,
2025-05-07T20:33:50.5446531Z         D: int,
2025-05-07T20:33:50.5446629Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5446714Z         contiguous: bool,
2025-05-07T20:33:50.5446800Z         compiled: bool,
2025-05-07T20:33:50.5446876Z     ) -> None:
2025-05-07T20:33:50.5446966Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5447039Z     
2025-05-07T20:33:50.5447207Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5449145Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5449154Z 
2025-05-07T20:33:50.5449272Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:50.5449315Z 
2025-05-07T20:33:50.5449414Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5449644Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5449716Z     T=2048,
2025-05-07T20:33:50.5449797Z     D=5120,
2025-05-07T20:33:50.5449879Z     scale_ub=None,
2025-05-07T20:33:50.5449964Z     contiguous=False,
2025-05-07T20:33:50.5450054Z     compiled=False,
2025-05-07T20:33:50.5450145Z )
2025-05-07T20:33:50.5450390Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5450568Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:50.5450578Z 
2025-05-07T20:33:50.5450656Z     @given(
2025-05-07T20:33:50.5450775Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5450875Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5450985Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5451144Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5451259Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5451332Z     )
2025-05-07T20:33:50.5451584Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5451675Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5451750Z         self,
2025-05-07T20:33:50.5451827Z         T: int,
2025-05-07T20:33:50.5451902Z         D: int,
2025-05-07T20:33:50.5451998Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5452088Z         contiguous: bool,
2025-05-07T20:33:50.5452170Z         compiled: bool,
2025-05-07T20:33:50.5452246Z     ) -> None:
2025-05-07T20:33:50.5452343Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5452412Z     
2025-05-07T20:33:50.5452586Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5454542Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5454551Z 
2025-05-07T20:33:50.5454673Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:50.5454678Z 
2025-05-07T20:33:50.5454776Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5455002Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5455155Z     T=4096,
2025-05-07T20:33:50.5455229Z     D=7168,
2025-05-07T20:33:50.5455312Z     scale_ub=None,
2025-05-07T20:33:50.5455399Z     contiguous=True,
2025-05-07T20:33:50.5455483Z     compiled=True,
2025-05-07T20:33:50.5455556Z )
2025-05-07T20:33:50.5455779Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5455950Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:50.5455954Z 
2025-05-07T20:33:50.5456033Z     @given(
2025-05-07T20:33:50.5456145Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5456244Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5456358Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5456472Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5456584Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5456663Z     )
2025-05-07T20:33:50.5456959Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5457057Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5457129Z         self,
2025-05-07T20:33:50.5457201Z         T: int,
2025-05-07T20:33:50.5457281Z         D: int,
2025-05-07T20:33:50.5457377Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5457505Z         contiguous: bool,
2025-05-07T20:33:50.5457592Z         compiled: bool,
2025-05-07T20:33:50.5457667Z     ) -> None:
2025-05-07T20:33:50.5457760Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5457833Z     
2025-05-07T20:33:50.5457998Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5459991Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5460000Z 
2025-05-07T20:33:50.5460117Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:50.5460121Z 
2025-05-07T20:33:50.5460220Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5460446Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5460518Z     T=2048,
2025-05-07T20:33:50.5460598Z     D=5120,
2025-05-07T20:33:50.5460678Z     scale_ub=1200.0,
2025-05-07T20:33:50.5460762Z     contiguous=False,
2025-05-07T20:33:50.5460849Z     compiled=False,
2025-05-07T20:33:50.5460929Z )
2025-05-07T20:33:50.5461148Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5461331Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:50.5461338Z 
2025-05-07T20:33:50.5461414Z     @given(
2025-05-07T20:33:50.5461527Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5461632Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5461745Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5461867Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5461977Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5462049Z     )
2025-05-07T20:33:50.5462298Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5462388Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5462461Z         self,
2025-05-07T20:33:50.5462536Z         T: int,
2025-05-07T20:33:50.5462608Z         D: int,
2025-05-07T20:33:50.5462702Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5462788Z         contiguous: bool,
2025-05-07T20:33:50.5462872Z         compiled: bool,
2025-05-07T20:33:50.5463001Z     ) -> None:
2025-05-07T20:33:50.5463094Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5463162Z     
2025-05-07T20:33:50.5463334Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5465230Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5465238Z 
2025-05-07T20:33:50.5465354Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:50.5465358Z 
2025-05-07T20:33:50.5465460Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5465724Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5465804Z     T=4096,
2025-05-07T20:33:50.5465880Z     D=7168,
2025-05-07T20:33:50.5465962Z     scale_ub=1200.0,
2025-05-07T20:33:50.5466050Z     contiguous=True,
2025-05-07T20:33:50.5466132Z     compiled=False,
2025-05-07T20:33:50.5466249Z )
2025-05-07T20:33:50.5466470Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5466644Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:50.5466648Z 
2025-05-07T20:33:50.5466731Z     @given(
2025-05-07T20:33:50.5466844Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5466941Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5467057Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5467171Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5467281Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5467358Z     )
2025-05-07T20:33:50.5467611Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5467705Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5467819Z         self,
2025-05-07T20:33:50.5467892Z         T: int,
2025-05-07T20:33:50.5467968Z         D: int,
2025-05-07T20:33:50.5468062Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5468148Z         contiguous: bool,
2025-05-07T20:33:50.5468233Z         compiled: bool,
2025-05-07T20:33:50.5468310Z     ) -> None:
2025-05-07T20:33:50.5468399Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5468470Z     
2025-05-07T20:33:50.5468638Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5470546Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5470556Z 
2025-05-07T20:33:50.5470670Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:50.5470674Z 
2025-05-07T20:33:50.5470779Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5471003Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5471076Z     T=16384,
2025-05-07T20:33:50.5471156Z     D=7168,
2025-05-07T20:33:50.5471238Z     scale_ub=None,
2025-05-07T20:33:50.5471323Z     contiguous=False,
2025-05-07T20:33:50.5471406Z     compiled=True,
2025-05-07T20:33:50.5471478Z )
2025-05-07T20:33:50.5471699Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5471927Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:50.5471932Z 
2025-05-07T20:33:50.5472008Z     @given(
2025-05-07T20:33:50.5472124Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5472231Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5472342Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5472457Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5472566Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5472640Z     )
2025-05-07T20:33:50.5472894Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5472987Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5473059Z         self,
2025-05-07T20:33:50.5473135Z         T: int,
2025-05-07T20:33:50.5473210Z         D: int,
2025-05-07T20:33:50.5473306Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5473402Z         contiguous: bool,
2025-05-07T20:33:50.5473485Z         compiled: bool,
2025-05-07T20:33:50.5473611Z     ) -> None:
2025-05-07T20:33:50.5473701Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5473772Z     
2025-05-07T20:33:50.5473942Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5475891Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5475897Z 
2025-05-07T20:33:50.5476014Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:50.5476021Z 
2025-05-07T20:33:50.5476122Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5476344Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5476421Z     T=4096,
2025-05-07T20:33:50.5476538Z     D=7168,
2025-05-07T20:33:50.5476627Z     scale_ub=None,
2025-05-07T20:33:50.5476719Z     contiguous=True,
2025-05-07T20:33:50.5476807Z     compiled=False,
2025-05-07T20:33:50.5476882Z )
2025-05-07T20:33:50.5477107Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5477280Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:50.5477285Z 
2025-05-07T20:33:50.5477363Z     @given(
2025-05-07T20:33:50.5477480Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5477580Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5477697Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5477816Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5477933Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5478009Z     )
2025-05-07T20:33:50.5478258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5478360Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5478439Z         self,
2025-05-07T20:33:50.5478518Z         T: int,
2025-05-07T20:33:50.5478599Z         D: int,
2025-05-07T20:33:50.5478692Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5478778Z         contiguous: bool,
2025-05-07T20:33:50.5478863Z         compiled: bool,
2025-05-07T20:33:50.5478936Z     ) -> None:
2025-05-07T20:33:50.5479025Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5479095Z     
2025-05-07T20:33:50.5479259Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5481171Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5481226Z 
2025-05-07T20:33:50.5481341Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:50.5481345Z 
2025-05-07T20:33:50.5481452Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5481683Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5481758Z     T=16384,
2025-05-07T20:33:50.5481835Z     D=7168,
2025-05-07T20:33:50.5481919Z     scale_ub=None,
2025-05-07T20:33:50.5482001Z     contiguous=True,
2025-05-07T20:33:50.5482086Z     compiled=False,
2025-05-07T20:33:50.5482163Z )
2025-05-07T20:33:50.5482424Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5482603Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:50.5482608Z 
2025-05-07T20:33:50.5482682Z     @given(
2025-05-07T20:33:50.5482797Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5482938Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5483048Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5483166Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5483275Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5483349Z     )
2025-05-07T20:33:50.5483599Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5483695Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5483772Z         self,
2025-05-07T20:33:50.5483848Z         T: int,
2025-05-07T20:33:50.5483922Z         D: int,
2025-05-07T20:33:50.5484019Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5484113Z         contiguous: bool,
2025-05-07T20:33:50.5484194Z         compiled: bool,
2025-05-07T20:33:50.5484277Z     ) -> None:
2025-05-07T20:33:50.5484367Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5489044Z     
2025-05-07T20:33:50.5489240Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5491168Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5491177Z 
2025-05-07T20:33:50.5491297Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:50.5491305Z 
2025-05-07T20:33:50.5491408Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5491640Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5491722Z     T=16384,
2025-05-07T20:33:50.5491805Z     D=7168,
2025-05-07T20:33:50.5491889Z     scale_ub=1200.0,
2025-05-07T20:33:50.5491974Z     contiguous=True,
2025-05-07T20:33:50.5492062Z     compiled=False,
2025-05-07T20:33:50.5492139Z )
2025-05-07T20:33:50.5492359Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5492542Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:50.5492546Z 
2025-05-07T20:33:50.5492622Z     @given(
2025-05-07T20:33:50.5492740Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5492837Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5492997Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5493116Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5493227Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5493300Z     )
2025-05-07T20:33:50.5493555Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5493649Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5493728Z         self,
2025-05-07T20:33:50.5493806Z         T: int,
2025-05-07T20:33:50.5493884Z         D: int,
2025-05-07T20:33:50.5493983Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5494071Z         contiguous: bool,
2025-05-07T20:33:50.5494154Z         compiled: bool,
2025-05-07T20:33:50.5494234Z     ) -> None:
2025-05-07T20:33:50.5494327Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5494398Z     
2025-05-07T20:33:50.5494729Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5496690Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5496736Z 
2025-05-07T20:33:50.5496857Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:50.5496862Z 
2025-05-07T20:33:50.5496963Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5497196Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5497272Z     T=128,
2025-05-07T20:33:50.5497351Z     D=5120,
2025-05-07T20:33:50.5497436Z     scale_ub=1200.0,
2025-05-07T20:33:50.5497523Z     contiguous=False,
2025-05-07T20:33:50.5497605Z     compiled=False,
2025-05-07T20:33:50.5497682Z )
2025-05-07T20:33:50.5497905Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5498123Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:50.5498128Z 
2025-05-07T20:33:50.5498209Z     @given(
2025-05-07T20:33:50.5498326Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5498422Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5498539Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5498653Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5498771Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5498846Z     )
2025-05-07T20:33:50.5499096Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5499194Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5499271Z         self,
2025-05-07T20:33:50.5499350Z         T: int,
2025-05-07T20:33:50.5499437Z         D: int,
2025-05-07T20:33:50.5499538Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5499626Z         contiguous: bool,
2025-05-07T20:33:50.5499713Z         compiled: bool,
2025-05-07T20:33:50.5499795Z     ) -> None:
2025-05-07T20:33:50.5499889Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5499976Z     
2025-05-07T20:33:50.5500142Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5500221Z     
2025-05-07T20:33:50.5500311Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5500436Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5500538Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5500621Z         x0 = x[:, :D]
2025-05-07T20:33:50.5500702Z         x1 = x[:, D:]
2025-05-07T20:33:50.5500780Z     
2025-05-07T20:33:50.5500867Z         if contiguous:
2025-05-07T20:33:50.5500965Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5501066Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5501186Z     
2025-05-07T20:33:50.5501283Z         if scale_ub is not None:
2025-05-07T20:33:50.5501395Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5501530Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5501618Z             )
2025-05-07T20:33:50.5501699Z         else:
2025-05-07T20:33:50.5501797Z             scale_ub_tensor = None
2025-05-07T20:33:50.5501876Z     
2025-05-07T20:33:50.5502005Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5502093Z             op = silu_mul_quant
2025-05-07T20:33:50.5502183Z             if compiled:
2025-05-07T20:33:50.5502281Z                 op = torch.compile(op)
2025-05-07T20:33:50.5502384Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5502458Z     
2025-05-07T20:33:50.5502546Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5502551Z 
2025-05-07T20:33:50.5502649Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5502786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5502928Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5503030Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5503562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5503698Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5504079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5504308Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5504664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5504757Z     kernel = self.compile(
2025-05-07T20:33:50.5505158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5505340Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5505467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5505472Z 
2025-05-07T20:33:50.5505719Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1cb3f3b0>
2025-05-07T20:33:50.5506538Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5507056Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1c9487c0>}
2025-05-07T20:33:50.5507852Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5508051Z context = <triton._C.libtriton.ir.context object at 0x7f1b1ca9fbf0>
2025-05-07T20:33:50.5508056Z 
2025-05-07T20:33:50.5508227Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5508502Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5508609Z                            module_map=module_map)
2025-05-07T20:33:50.5508775Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5508872Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5508948Z E       ^
2025-05-07T20:33:50.5509322Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5509327Z 
2025-05-07T20:33:50.5509758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5509804Z 
2025-05-07T20:33:50.5509913Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5510144Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5510219Z     T=2048,
2025-05-07T20:33:50.5510296Z     D=7168,
2025-05-07T20:33:50.5510379Z     scale_ub=None,
2025-05-07T20:33:50.5510467Z     contiguous=False,
2025-05-07T20:33:50.5510557Z     compiled=False,
2025-05-07T20:33:50.5510631Z )
2025-05-07T20:33:50.5510857Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5511034Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:50.5511038Z 
2025-05-07T20:33:50.5511114Z     @given(
2025-05-07T20:33:50.5511234Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5511331Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5511441Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5511558Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5511669Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5511784Z     )
2025-05-07T20:33:50.5512040Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5512135Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5512211Z         self,
2025-05-07T20:33:50.5512327Z         T: int,
2025-05-07T20:33:50.5512400Z         D: int,
2025-05-07T20:33:50.5512500Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5512587Z         contiguous: bool,
2025-05-07T20:33:50.5512670Z         compiled: bool,
2025-05-07T20:33:50.5512747Z     ) -> None:
2025-05-07T20:33:50.5512840Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5512913Z     
2025-05-07T20:33:50.5513086Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5515037Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5515048Z 
2025-05-07T20:33:50.5515169Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:50.5515174Z 
2025-05-07T20:33:50.5515273Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5515499Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5515574Z     T=128,
2025-05-07T20:33:50.5515649Z     D=7168,
2025-05-07T20:33:50.5515733Z     scale_ub=1200.0,
2025-05-07T20:33:50.5515814Z     contiguous=True,
2025-05-07T20:33:50.5515896Z     compiled=True,
2025-05-07T20:33:50.5515978Z )
2025-05-07T20:33:50.5516203Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5516373Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:50.5516378Z 
2025-05-07T20:33:50.5516458Z     @given(
2025-05-07T20:33:50.5516576Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5516679Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5516791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5516903Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5517014Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5517087Z     )
2025-05-07T20:33:50.5517336Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5517428Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5517502Z         self,
2025-05-07T20:33:50.5517575Z         T: int,
2025-05-07T20:33:50.5517650Z         D: int,
2025-05-07T20:33:50.5517745Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5517875Z         contiguous: bool,
2025-05-07T20:33:50.5517962Z         compiled: bool,
2025-05-07T20:33:50.5518039Z     ) -> None:
2025-05-07T20:33:50.5518131Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5518197Z     
2025-05-07T20:33:50.5518367Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5518446Z     
2025-05-07T20:33:50.5518535Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5518661Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5518748Z         x = x_sign * x_clamp
2025-05-07T20:33:50.5518822Z         x0 = x[:, :D]
2025-05-07T20:33:50.5518898Z         x1 = x[:, D:]
2025-05-07T20:33:50.5518972Z     
2025-05-07T20:33:50.5519049Z         if contiguous:
2025-05-07T20:33:50.5519138Z             x0 = x0.contiguous()
2025-05-07T20:33:50.5519225Z             x1 = x1.contiguous()
2025-05-07T20:33:50.5519293Z     
2025-05-07T20:33:50.5519379Z         if scale_ub is not None:
2025-05-07T20:33:50.5519487Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:50.5519678Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:50.5519768Z             )
2025-05-07T20:33:50.5519856Z         else:
2025-05-07T20:33:50.5519961Z             scale_ub_tensor = None
2025-05-07T20:33:50.5520038Z     
2025-05-07T20:33:50.5520227Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:50.5520311Z             op = silu_mul_quant
2025-05-07T20:33:50.5520395Z             if compiled:
2025-05-07T20:33:50.5520489Z                 op = torch.compile(op)
2025-05-07T20:33:50.5520589Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5520663Z     
2025-05-07T20:33:50.5520751Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:50.5520755Z 
2025-05-07T20:33:50.5520849Z moe/activation_test.py:117: 
2025-05-07T20:33:50.5520981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5521077Z moe/activation_test.py:115: in fn
2025-05-07T20:33:50.5521182Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:50.5521567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:50.5521657Z     return fn(*args, **kwargs)
2025-05-07T20:33:50.5522226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:50.5522323Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:50.5522697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:50.5522920Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:50.5523273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:50.5523371Z     kernel = self.compile(
2025-05-07T20:33:50.5523768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:50.5523951Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:50.5524085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:50.5524089Z 
2025-05-07T20:33:50.5524294Z self = <triton.compiler.compiler.ASTSource object at 0x7f1b1caff680>
2025-05-07T20:33:50.5525102Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:50.5525981Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1c4f619440>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1b1c949940>}
2025-05-07T20:33:50.5526782Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:50.5527067Z context = <triton._C.libtriton.ir.context object at 0x7f1b1ca70f30>
2025-05-07T20:33:50.5527072Z 
2025-05-07T20:33:50.5527241Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:50.5527520Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:50.5527629Z                            module_map=module_map)
2025-05-07T20:33:50.5527793Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:50.5527900Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:50.5527978Z E       ^
2025-05-07T20:33:50.5528350Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:50.5528355Z 
2025-05-07T20:33:50.5528781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:50.5528788Z 
2025-05-07T20:33:50.5528953Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5529188Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5529262Z     T=128,
2025-05-07T20:33:50.5529336Z     D=7168,
2025-05-07T20:33:50.5529479Z     scale_ub=1200.0,
2025-05-07T20:33:50.5529563Z     contiguous=True,
2025-05-07T20:33:50.5529653Z     compiled=False,
2025-05-07T20:33:50.5529728Z )
2025-05-07T20:33:50.5529954Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5530154Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:50.5530158Z 
2025-05-07T20:33:50.5530253Z     @given(
2025-05-07T20:33:50.5530376Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5530480Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5530593Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5530713Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5530829Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5530901Z     )
2025-05-07T20:33:50.5531214Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5531313Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5531393Z         self,
2025-05-07T20:33:50.5531474Z         T: int,
2025-05-07T20:33:50.5531549Z         D: int,
2025-05-07T20:33:50.5531644Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5531732Z         contiguous: bool,
2025-05-07T20:33:50.5531813Z         compiled: bool,
2025-05-07T20:33:50.5531885Z     ) -> None:
2025-05-07T20:33:50.5531979Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5532047Z     
2025-05-07T20:33:50.5532220Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5532291Z     
2025-05-07T20:33:50.5532380Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5532509Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5534421Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5534431Z 
2025-05-07T20:33:50.5534636Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:50.5534641Z 
2025-05-07T20:33:50.5534741Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5534965Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5535048Z     T=128,
2025-05-07T20:33:50.5535170Z     D=5120,
2025-05-07T20:33:50.5535256Z     scale_ub=1200.0,
2025-05-07T20:33:50.5535350Z     contiguous=True,
2025-05-07T20:33:50.5535432Z     compiled=True,
2025-05-07T20:33:50.5535512Z )
2025-05-07T20:33:50.5535733Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5535901Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:50.5535907Z 
2025-05-07T20:33:50.5535989Z     @given(
2025-05-07T20:33:50.5536103Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5536199Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5536312Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5536426Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5536533Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5536609Z     )
2025-05-07T20:33:50.5536854Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5536955Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5537079Z         self,
2025-05-07T20:33:50.5537156Z         T: int,
2025-05-07T20:33:50.5537232Z         D: int,
2025-05-07T20:33:50.5537325Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5537412Z         contiguous: bool,
2025-05-07T20:33:50.5537498Z         compiled: bool,
2025-05-07T20:33:50.5537613Z     ) -> None:
2025-05-07T20:33:50.5537703Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5537777Z     
2025-05-07T20:33:50.5537946Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5538018Z     
2025-05-07T20:33:50.5538108Z         x_sign = torch.sign(x)
2025-05-07T20:33:50.5538231Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:50.5540171Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5540182Z 
2025-05-07T20:33:50.5540296Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:50.5540300Z 
2025-05-07T20:33:50.5540401Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:50.5540623Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:50.5540696Z     T=128,
2025-05-07T20:33:50.5540775Z     D=7168,
2025-05-07T20:33:50.5540865Z     scale_ub=None,
2025-05-07T20:33:50.5540951Z     contiguous=True,
2025-05-07T20:33:50.5541037Z     compiled=True,
2025-05-07T20:33:50.5541112Z )
2025-05-07T20:33:50.5541328Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:50.5541502Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:50.5541507Z 
2025-05-07T20:33:50.5541582Z     @given(
2025-05-07T20:33:50.5541699Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:50.5541801Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:50.5541914Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:50.5542036Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:50.5542145Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:50.5542213Z     )
2025-05-07T20:33:50.5542466Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:50.5542555Z     def test_silu_mul_quant(
2025-05-07T20:33:50.5542624Z         self,
2025-05-07T20:33:50.5542702Z         T: int,
2025-05-07T20:33:50.5542776Z         D: int,
2025-05-07T20:33:50.5542881Z         scale_ub: Optional[float],
2025-05-07T20:33:50.5543010Z         contiguous: bool,
2025-05-07T20:33:50.5543091Z         compiled: bool,
2025-05-07T20:33:50.5543175Z     ) -> None:
2025-05-07T20:33:50.5543268Z         torch.manual_seed(2025)
2025-05-07T20:33:50.5543338Z     
2025-05-07T20:33:50.5543507Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:50.5545384Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:50.5545391Z 
2025-05-07T20:33:50.5545507Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:50.5545682Z =============================== warnings summary ===============================
2025-05-07T20:33:50.5546002Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:50.5546325Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:50.5546674Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:50.5547606Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:50.5547837Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:50.5547842Z 
2025-05-07T20:33:50.5548061Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:50.5548232Z ================= 1 failed, 1 deselected, 3 warnings in 13.30s =================
2025-05-07T20:33:52.2320817Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:52.2976542Z [EXEC] [ATTEMPT 2/2] Command attempt failed.
2025-05-07T20:33:52.2977024Z 
2025-05-07T20:33:52.2977367Z [EXEC] The command has failed after 2 + 1 attempts; aborting.
2025-05-07T20:33:52.2978537Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py
2025-05-07T20:33:52.2979369Z 
2025-05-07T20:33:52.2979378Z 
2025-05-07T20:33:52.2979385Z 
2025-05-07T20:33:52.2995155Z ##[error]Process completed with exit code 1.
2025-05-07T20:33:52.3079048Z Post job cleanup.
2025-05-07T20:33:52.4064281Z [command]/usr/bin/git version
2025-05-07T20:33:52.4109069Z git version 2.47.1
2025-05-07T20:33:52.4149356Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/e2f51991-c98e-412c-96de-984594d25122/.gitconfig'
2025-05-07T20:33:52.4161768Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/e2f51991-c98e-412c-96de-984594d25122' before making global git config changes
2025-05-07T20:33:52.4162661Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:33:52.4167701Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:33:52.4213676Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:33:52.4249457Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:33:52.4590945Z Entering 'external/asmjit'
2025-05-07T20:33:52.4658129Z Entering 'external/composable_kernel'
2025-05-07T20:33:52.4730538Z Entering 'external/cpuinfo'
2025-05-07T20:33:52.4796396Z Entering 'external/cutlass'
2025-05-07T20:33:52.4872026Z Entering 'external/googletest'
2025-05-07T20:33:52.4939139Z Entering 'external/hipify_torch'
2025-05-07T20:33:52.5006267Z Entering 'external/json'
2025-05-07T20:33:52.5093203Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:33:52.5115990Z http.https://github.com/.extraheader
2025-05-07T20:33:52.5126817Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2025-05-07T20:33:52.5159277Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:33:52.5493711Z Entering 'external/asmjit'
2025-05-07T20:33:52.5536369Z http.https://github.com/.extraheader
2025-05-07T20:33:52.5579167Z Entering 'external/composable_kernel'
2025-05-07T20:33:52.5622244Z http.https://github.com/.extraheader
2025-05-07T20:33:52.5673019Z Entering 'external/cpuinfo'
2025-05-07T20:33:52.5715701Z http.https://github.com/.extraheader
2025-05-07T20:33:52.5759318Z Entering 'external/cutlass'
2025-05-07T20:33:52.5808990Z http.https://github.com/.extraheader
2025-05-07T20:33:52.5861097Z Entering 'external/googletest'
2025-05-07T20:33:52.5902910Z http.https://github.com/.extraheader
2025-05-07T20:33:52.5945667Z Entering 'external/hipify_torch'
2025-05-07T20:33:52.5987583Z http.https://github.com/.extraheader
2025-05-07T20:33:52.6029708Z Entering 'external/json'
2025-05-07T20:33:52.6072636Z http.https://github.com/.extraheader
2025-05-07T20:33:52.6221317Z A job completed hook has been configured by the self-hosted runner administrator
2025-05-07T20:33:52.6257761Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh'
2025-05-07T20:33:52.6268486Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:33:52.6268874Z ##[endgroup]
2025-05-07T20:33:52.6371913Z [!ALERT!] Swap in detected! [!ALERT!]
2025-05-07T20:34:03.5949745Z [!ALERT!] Swap out detected [!ALERT!]
2025-05-07T20:34:20.3310344Z Cleaning up orphan processes